跳到主要内容

2025-05-28-12-07

xChemAgents: Agentic AI for Explainable Quantum Chemistry

Abstract

arXiv:2505.20574v1 Announce Type: new Abstract: Recent progress in multimodal graph neural networks has demonstrated that augmenting atomic XYZ geometries with textual chemical descriptors can enhance predictive accuracy across a range of electronic and thermodynamic properties. However, naively appending large sets of heterogeneous descriptors often degrades performance on tasks sensitive to molecular shape or symmetry, and undermines interpretability. xChemAgents proposes a cooperative agent framework that injects physics-aware reasoning into multimodal property prediction. xChemAgents comprises two language-model-based agents: a Selector, which adaptively identifies a sparse, weighted subset of descriptors relevant to each target, and provides a natural language rationale; and a Validator, which enforces physical constraints such as unit consistency and scaling laws through iterative dialogue. On standard benchmark datasets, xChemAgents achieves up to a 22% reduction in mean absolute error over strong baselines, while producing faithful, human-interpretable explanations. Experiment results highlight the potential of cooperative, self-verifying agents to enhance both accuracy and transparency in foundation-model-driven materials science. The implementation and accompanying dataset are available anonymously at https://github.com/KurbanIntelligenceLab/xChemAgents.

摘要

多模态图神经网络的最新进展表明,通过将原子XYZ几何结构与文本化学描述符相结合,可以提高对多种电子和热力学性质的预测准确性。然而,简单地附加大量异构描述符往往会降低对分子形状或对称性敏感任务的性能,并损害可解释性。xChemAgents提出了一种协作代理框架,将物理感知推理注入多模态性质预测中。xChemAgents包含两个基于语言模型的代理:选择器(Selector)自适应地识别与每个目标相关的稀疏加权描述符子集,并提供自然语言依据;验证器(Validator)通过迭代对话强制执行物理约束,如单位一致性和标度律。在标准基准数据集上,xChemAgents相较于强基线实现了高达22%的平均绝对误差降低,同时生成忠实、人类可解释的说明。实验结果凸显了协作自验证代理在提升基础模型驱动材料科学的准确性和透明度方面的潜力。实现代码及配套数据集可通过匿名链接https://github.com/KurbanIntelligenceLab/xChemAgents获取。


Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System

Abstract

arXiv:2505.20310v1 Announce Type: new Abstract: Meta-analysis is a systematic research methodology that synthesizes data from multiple existing studies to derive comprehensive conclusions. This approach not only mitigates limitations inherent in individual studies but also facilitates novel discoveries through integrated data analysis. Traditional meta-analysis involves a complex multi-stage pipeline including literature retrieval, paper screening, and data extraction, which demands substantial human effort and time. However, while LLM-based methods can accelerate certain stages, they still face significant challenges, such as hallucinations in paper screening and data extraction. In this paper, we propose a multi-agent system, Manalyzer, which achieves end-to-end automated meta-analysis through tool calls. The hybrid review, hierarchical extraction, self-proving, and feedback checking strategies implemented in Manalyzer significantly alleviate these two hallucinations. To comprehensively evaluate the performance of meta-analysis, we construct a new benchmark comprising 729 papers across 3 domains, encompassing text, image, and table modalities, with over 10,000 data points. Extensive experiments demonstrate that Manalyzer achieves significant performance improvements over the LLM baseline in multi meta-analysis tasks. Project page: https://black-yt.github.io/meta-analysis-page/ .

摘要

元分析是一种系统性研究方法,通过整合多个现有研究的数据以得出综合结论。这种方法不仅能减轻单个研究固有的局限性,还能通过集成数据分析促进新发现。传统元分析涉及文献检索、论文筛选和数据提取等复杂多阶段流程,需要耗费大量人力与时间。尽管基于大语言模型的方法能加速某些环节,但仍面临重大挑战,例如论文筛选和数据提取中的幻觉问题。本文提出多智能体系统Manalyzer,通过工具调用实现端到端自动化元分析。该系统采用的混合评审、分层提取、自证与反馈校验策略显著缓解了上述两类幻觉问题。为全面评估元分析性能,我们构建了包含3个领域(文本、图像和表格模态)729篇论文的新基准数据集,涵盖超10,000个数据点。大量实验表明,Manalyzer在多类元分析任务中较基线大语言模型实现了显著性能提升。项目页面:https://black-yt.github.io/meta-analysis-page/。


Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting

Abstract

arXiv:2505.20521v1 Announce Type: new Abstract: This paper presents Project Riley, a novel multimodal and multi-model conversational AI architecture oriented towards the simulation of reasoning influenced by emotional states. Drawing inspiration from Pixar's Inside Out, the system comprises five distinct emotional agents - Joy, Sadness, Fear, Anger, and Disgust - that engage in structured multi-round dialogues to generate, criticise, and iteratively refine responses. A final reasoning mechanism synthesises the contributions of these agents into a coherent output that either reflects the dominant emotion or integrates multiple perspectives. The architecture incorporates both textual and visual large language models (LLMs), alongside advanced reasoning and self-refinement processes. A functional prototype was deployed locally in an offline environment, optimised for emotional expressiveness and computational efficiency. From this initial prototype, another one emerged, called Armando, which was developed for use in emergency contexts, delivering emotionally calibrated and factually accurate information through the integration of Retrieval-Augmented Generation (RAG) and cumulative context tracking. The Project Riley prototype was evaluated through user testing, in which participants interacted with the chatbot and completed a structured questionnaire assessing three dimensions: Emotional Appropriateness, Clarity and Utility, and Naturalness and Human-likeness. The results indicate strong performance in structured scenarios, particularly with respect to emotional alignment and communicative clarity.

摘要

本文介绍了莱利项目(Project Riley),一种新型多模态多模型对话式人工智能架构,旨在模拟受情绪状态影响的推理过程。受皮克斯电影《头脑特工队》启发,该系统由五个独立的情感代理(快乐、悲伤、恐惧、愤怒和厌恶)组成,这些代理通过结构化多轮对话进行回答生成、批评和迭代优化。最终推理机制将这些代理的贡献综合为连贯输出,既可体现主导情绪,也能整合多元观点。该架构整合了文本与视觉大语言模型(LLMs),并采用先进的推理与自我优化流程。我们在离线环境中部署了功能原型,针对情感表达能力和计算效率进行了优化。基于该原型衍生出应急场景专用版本Armando,通过检索增强生成(RAG)和累积上下文追踪技术,提供情绪适配且事实准确的信息。莱利项目原型通过用户测试进行评估,参与者与聊天机器人交互后完成结构化问卷,从三个维度进行测评:情绪适配性、清晰度与实用性、自然度与拟人性。结果显示在结构化场景中表现优异,尤其在情绪匹配和沟通清晰度方面。


SCAR: Shapley Credit Assignment for More Efficient RLHF

Abstract

arXiv:2505.20417v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) is a widely used technique for aligning Large Language Models (LLMs) with human preferences, yet it often suffers from sparse reward signals, making effective credit assignment challenging. In typical setups, the reward model provides a single scalar score for an entire generated sequence, offering little insight into which token or span-level decisions were responsible for the outcome. To address this, we propose Shapley Credit Assignment Rewards (SCAR), a novel method that leverages Shapley values in cooperative game theory. SCAR distributes the total sequence-level reward among constituent tokens or text spans based on their principled marginal contributions. This creates dense reward signals, crucially, without necessitating the training of auxiliary critique models or recourse to fine-grained human annotations at intermediate generation stages. Unlike prior dense reward methods, SCAR offers a game-theoretic foundation for fair credit attribution. Theoretically, we demonstrate that SCAR preserves the original optimal policy, and empirically, across diverse tasks including sentiment control, text summarization, and instruction tuning, we show that SCAR converges significantly faster and achieves higher final reward scores compared to standard RLHF and attention-based dense reward baselines. Our findings suggest that SCAR provides a more effective and theoretically sound method for credit assignment in RLHF, leading to more efficient alignment of LLMs.

摘要

基于人类反馈的强化学习(RLHF)是一种广泛使用的技术,用于将大型语言模型(LLMs)与人类偏好对齐,但其常受稀疏奖励信号的困扰,导致有效的信用分配具有挑战性。在典型设置中,奖励模型仅为整个生成序列提供单一标量分数,难以揭示哪些词元或片段级决策对结果产生了影响。为解决这一问题,我们提出夏普利信用分配奖励(SCAR),这是一种利用合作博弈论中夏普利值的新方法。SCAR基于各成分词元或文本片段的边际贡献,将序列级总奖励按原则性分配。这一方法创造了密集的奖励信号,且关键无需训练辅助评论模型或依赖中间生成阶段的细粒度人工标注。与现有密集奖励方法不同,SCAR为公平信用归因提供了博弈论基础。理论上,我们证明SCAR保留了原始最优策略;实证上,在情感控制、文本摘要和指令调优等多样化任务中,相较于标准RLHF和基于注意力的密集奖励基线,SCAR收敛速度显著更快且最终奖励分数更高。我们的研究结果表明,SCAR为RLHF中的信用分配提供了一种更有效且理论可靠的方法,从而实现了LLMs更高效的对齐。


Scaling over Scaling: Exploring Test-Time Scaling Pareto in Large Reasoning Models

Abstract

arXiv:2505.20522v1 Announce Type: new Abstract: Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling. Building upon this, a promising direction is to further scale test-time compute to unlock even greater reasoning capabilities. However, as we push these scaling boundaries, systematically understanding the practical limits and achieving optimal resource allocation becomes a critical challenge. In this paper, we investigate the scaling Pareto of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM). We theoretically analyze two fundamental paradigms for such extended scaling, parallel scaling and sequential scaling, from a probabilistic modeling perspective. Our primary contribution is the derivation of the saturation point on the scaling budget for both strategies, identifying thresholds beyond which additional computation yields diminishing returns. Remarkably, despite their distinct mechanisms, both paradigms converge to a unified mathematical structure in their upper bounds. We empirically validate our theoretical findings on challenging reasoning benchmarks, including AIME, MATH-500, and GPQA, demonstrating the practical utility of these bounds for test-time resource allocation. We hope that this work provides insights into the cost-benefit trade-offs of test-time scaling, guiding the development of more resource-efficient inference strategies for large reasoning models.

摘要

大型推理模型(LRMs)已展现出通过内部测试时扩展提升推理性能的能力。基于此,进一步扩展测试时计算以释放更强推理能力成为具有前景的研究方向。然而,随着扩展边界的不断推进,系统理解实践极限并实现最优资源配置成为关键挑战。本文研究了测试时扩展的帕累托边界,并提出测试时扩展性能模型(TTSPM)。我们从概率建模角度理论分析了两种基本扩展范式——并行扩展与序列扩展。主要贡献在于推导出两种策略在扩展预算上的饱和点,确定了超出该阈值后额外计算将产生收益递减的临界值。值得注意的是,尽管机制不同,这两种范式在其上界处收敛于统一的数学结构。我们在AIME、MATH-500和GPQA等具有挑战性的推理基准上实证验证了理论发现,证明了这些边界对测试时资源分配的实际效用。本研究希望为测试时扩展的成本效益权衡提供见解,指导开发更具资源效率的大型推理模型推断策略。


CoderAgent: Simulating Student Behavior for Personalized Programming Learning with Large Language Models

Abstract

arXiv:2505.20642v1 Announce Type: new Abstract: Personalized programming tutoring, such as exercise recommendation, can enhance learners' efficiency, motivation, and outcomes, which is increasingly important in modern digital education. However, the lack of sufficient and high-quality programming data, combined with the mismatch between offline evaluation and real-world learning, hinders the practical deployment of such systems. To address this challenge, many approaches attempt to simulate learner practice data, yet they often overlook the fine-grained, iterative nature of programming learning, resulting in a lack of interpretability and granularity. To fill this gap, we propose a LLM-based agent, CoderAgent, to simulate students' programming processes in a fine-grained manner without relying on real data. Specifically, we equip each human learner with an intelligent agent, the core of which lies in capturing the cognitive states of the human programming practice process. Inspired by ACT-R, a cognitive architecture framework, we design the structure of CoderAgent to align with human cognitive architecture by focusing on the mastery of programming knowledge and the application of coding ability. Recognizing the inherent patterns in multi-layered cognitive reasoning, we introduce the Programming Tree of Thought (PTOT), which breaks down the process into four steps: why, how, where, and what. This approach enables a detailed analysis of iterative problem-solving strategies. Finally, experimental evaluations on real-world datasets demonstrate that CoderAgent provides interpretable insights into learning trajectories and achieves accurate simulations, paving the way for personalized programming education.

摘要

个性化编程辅导(如习题推荐)能够提升学习者的效率、动机和成果,这在现代数字教育中日益重要。然而,缺乏充足且高质量的编程数据,加之离线评估与实际学习场景的脱节,阻碍了此类系统的实际部署。为解决这一挑战,现有方法多尝试模拟学习者练习数据,却往往忽视编程学习细粒度、迭代式的本质,导致可解释性与精细度不足。为此,我们提出基于大语言模型的智能体CoderAgent,在不依赖真实数据的前提下细粒度模拟学生编程过程。具体而言,我们为每位人类学习者配备智能代理,其核心在于捕捉人类编程实践过程中的认知状态。受认知架构框架ACT-R启发,我们通过聚焦编程知识掌握与编码能力应用,设计CoderAgent结构以匹配人类认知架构。针对多层认知推理的固有规律,我们提出编程思维树(PTOT),将过程分解为'为何、如何、何处、何为'四个步骤,实现对迭代式问题解决策略的细粒度解析。最终,真实数据集上的实验评估表明,CoderAgent能为学习轨迹提供可解释的洞察,并实现精准模拟,为个性化编程教育铺平道路。


MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning

Abstract

arXiv:2505.20670v1 Announce Type: new Abstract: Complex tasks involving tool integration pose significant challenges for Large Language Models (LLMs), leading to the emergence of multi-agent workflows as a promising solution. Reflection has emerged as an effective strategy for correcting erroneous trajectories in agentic workflows. However, existing approaches only exploit such capability in the post-action stage, where the agent observes the execution outcomes. We argue that, like humans, LLMs can also engage in reflection before action execution: the agent can anticipate undesirable outcomes from its own decisions, which not only provides a necessarily complementary perspective to evaluate the decision but also prevents the propagation of errors throughout the trajectory. In this paper, we propose MIRROR, a framework that consists of both intra-reflection, which critically assesses intended actions before execution, and inter-reflection, which further adjusts the trajectory based on observations. This design systematically leverages LLM reflection capabilities to eliminate and rectify erroneous actions on a more comprehensive scope. Evaluations on both the StableToolBench and TravelPlanner benchmarks demonstrate MIRROR's superior performance, achieving state-of-the-art results compared to existing approaches.

摘要

涉及工具整合的复杂任务对大型语言模型(LLM)提出了重大挑战,这促使多智能体工作流成为一种有前景的解决方案。反思已成为纠正智能体工作流中错误轨迹的有效策略。然而,现有方法仅在行动后阶段利用这种能力,即智能体观察执行结果。我们认为,与人类类似,LLM也可以在行动执行前进行反思:智能体能够预见到自身决策可能产生的不良后果,这不仅为评估决策提供了必要的补充视角,还能防止错误在轨迹中传播。本文提出MIRROR框架,包含执行前批判性评估预期行动的内部反思(intra-reflection)和基于观察进一步调整轨迹的交互反思(inter-reflection)。这一设计系统性地利用LLM的反思能力,在更全面的范围内消除和纠正错误行动。在StableToolBench和TravelPlanner基准测试上的评估表明,MIRROR性能优越,相较于现有方法取得了最先进的结果。


LLM-Guided Reinforcement Learning: Addressing Training Bottlenecks through Policy Modulation

Abstract

arXiv:2505.20671v1 Announce Type: new Abstract: While reinforcement learning (RL) has achieved notable success in various domains, training effective policies for complex tasks remains challenging. Agents often converge to local optima and fail to maximize long-term rewards. Existing approaches to mitigate training bottlenecks typically fall into two categories: (i) Automated policy refinement, which identifies critical states from past trajectories to guide policy updates, but suffers from costly and uncertain model training; and (ii) Human-in-the-loop refinement, where human feedback is used to correct agent behavior, but this does not scale well to environments with large or continuous action spaces. In this work, we design a large language model-guided policy modulation framework that leverages LLMs to improve RL training without additional model training or human intervention. We first prompt an LLM to identify critical states from a sub-optimal agent's trajectories. Based on these states, the LLM then provides action suggestions and assigns implicit rewards to guide policy refinement. Experiments across standard RL benchmarks demonstrate that our method outperforms state-of-the-art baselines, highlighting the effectiveness of LLM-based explanations in addressing RL training bottlenecks.

摘要

尽管强化学习(RL)在多个领域取得了显著成功,但针对复杂任务训练有效策略仍具挑战性。智能体常收敛于局部最优而无法最大化长期奖励。现有缓解训练瓶颈的方法主要分为两类:(1)自动化策略优化,通过从历史轨迹中识别关键状态来指导策略更新,但存在模型训练成本高且效果不确定的问题;(2)人机协同优化,利用人类反馈修正智能体行为,但难以扩展至动作空间庞大或连续的环境。本研究设计了一个大语言模型引导的策略调制框架,利用LLM改进RL训练而无需额外模型训练或人工干预。我们首先提示LLM从次优智能体的轨迹中识别关键状态,随后基于这些状态由LLM提供动作建议并分配隐式奖励以指导策略优化。标准RL基准测试表明,本方法优于现有最优基线,凸显了基于LLM的解释在解决RL训练瓶颈中的有效性。


Reinforcement Speculative Decoding for Fast Ranking

Abstract

arXiv:2505.20316v1 Announce Type: new Abstract: Large Language Models (LLMs) have been widely adopted in ranking systems such as information retrieval (IR) systems and recommender systems (RSs). To alleviate the latency of auto-regressive decoding, some studies explore the single (first) token decoding for ranking approximation, but they suffer from severe degradation in tail positions. Although speculative decoding (SD) methods can be a remedy with verification at different positions, they face challenges in ranking systems due to their left-to-right decoding paradigm. Firstly, ranking systems require strict latency constraints, but verification rounds in SD methods remain agnostic; Secondly, SD methods usually discard listwise ranking knowledge about unaccepted items in previous rounds, hindering future multi-token prediction, especially when candidate tokens are the unaccepted items. In this paper, we propose a Reinforcement Speculative Decoding method for fast ranking inference of LLMs. To meet the ranking systems' latency requirement, we propose an up-to-down decoding paradigm that employs an agent to iteratively modify the ranking sequence under a constrained budget. Specifically, we design a ranking-tailored policy optimization, actively exploring optimal multi-round ranking modification policy verified by LLMs via reinforcement learning (RL). To better approximate the target LLM under the constrained budget, we trigger the agent fully utilizing the listwise ranking knowledge about all items verified by LLMs across different rounds in RL, enhancing the modification policy of the agent. More importantly, we demonstrate the theoretical robustness and advantages of our paradigm and implementation. Experiments on both IR and RS tasks show the effectiveness of our proposed method.

摘要

大型语言模型(LLMs)已广泛应用于信息检索(IR)系统和推荐系统(RS)等排序系统中。为缓解自回归解码的延迟问题,现有研究探索采用首单令牌解码进行排序近似,但此类方法在尾部位置存在显著性能退化。虽然推测式解码(SD)方法可通过多位置验证缓解该问题,但其从左至右的解码范式在排序系统中面临挑战:首先,排序系统要求严格的延迟约束,而SD方法的验证轮次具有不可预知性;其次,SD方法通常会丢弃先前轮次中未通过验证项目的列表排序知识,这阻碍了后续多令牌预测,尤其当候选令牌为先前未通过验证项目时。本文提出一种基于强化学习的推测式解码方法,用于LLMs的快速排序推理。为满足排序系统的延迟要求,我们采用自顶向下解码范式,通过智能体在预算约束下迭代修改排序序列。具体而言,我们设计了面向排序的策略优化方法,通过强化学习(RL)主动探索经LLMs验证的最优多轮排序修改策略。为在预算约束下更好逼近目标LLM,我们在RL训练中促使智能体充分利用LLMs跨轮次验证的所有项目的列表排序知识,从而提升其修改策略的有效性。更重要的是,我们从理论上证明了该范式及其实现的鲁棒性与优势。在IR和RS任务上的实验验证了所提方法的有效性。


Comparisons between a Large Language Model-based Real-Time Compound Diagnostic Medical AI Interface and Physicians for Common Internal Medicine Cases using Simulated Patients

Abstract

arXiv:2505.20609v1 Announce Type: new Abstract: Objective To develop an LLM based realtime compound diagnostic medical AI interface and performed a clinical trial comparing this interface and physicians for common internal medicine cases based on the United States Medical License Exam (USMLE) Step 2 Clinical Skill (CS) style exams. Methods A nonrandomized clinical trial was conducted on August 20, 2024. We recruited one general physician, two internal medicine residents (2nd and 3rd year), and five simulated patients. The clinical vignettes were adapted from the USMLE Step 2 CS style exams. We developed 10 representative internal medicine cases based on actual patients and included information available on initial diagnostic evaluation. Primary outcome was the accuracy of the first differential diagnosis. Repeatability was evaluated based on the proportion of agreement. Results The accuracy of the physicians' first differential diagnosis ranged from 50% to 70%, whereas the realtime compound diagnostic medical AI interface achieved an accuracy of 80%. The proportion of agreement for the first differential diagnosis was 0.7. The accuracy of the first and second differential diagnoses ranged from 70% to 90% for physicians, whereas the AI interface achieved an accuracy rate of 100%. The average time for the AI interface (557 sec) was 44.6% shorter than that of the physicians (1006 sec). The AI interface (0.08)alsoreducedcostsby98.10.08) also reduced costs by 98.1% compared to the physicians' average (4.2). Patient satisfaction scores ranged from 4.2 to 4.3 for care by physicians and were 3.9 for the AI interface Conclusion An LLM based realtime compound diagnostic medical AI interface demonstrated diagnostic accuracy and patient satisfaction comparable to those of a physician, while requiring less time and lower costs. These findings suggest that AI interfaces may have the potential to assist primary care consultations for common internal medicine cases.

摘要

目的 开发基于大型语言模型(LLM)的实时复合诊断医疗AI接口,并通过与美国医师执照考试(USMLE)第二阶段临床技能(CS)考试相似的临床试验,比较该接口与医师对常见内科病例的诊断能力。方法 于2024年8月20日进行非随机临床试验。招募1名全科医师、2名内科住院医师(第2年和第3年)及5名模拟患者。临床案例改编自USMLE Step 2 CS考试。我们基于真实患者开发了10个代表性内科病例,包含初始诊断评估的可用信息。主要结局指标是第一鉴别诊断的准确率,重复性通过诊断一致性比例评估。结果 医师第一鉴别诊断准确率为50%-70%,而实时复合诊断医疗AI接口达到80%。第一诊断一致性比例为0.7。医师第一和第二鉴别诊断准确率为70%-90%,而AI接口达到100%。AI接口平均用时(557秒)较医师(1006秒)缩短44.6%。AI接口成本(0.08美元)较医师平均成本(4.2美元)降低98.1%。患者对医师诊疗满意度评分为4.2-4.3分,AI接口为3.9分。结论 基于LLM的实时复合诊断医疗AI接口展现出与医师相当的诊断准确率和患者满意度,同时具有用时更短、成本更低的优势。这些发现表明AI接口可能具备辅助常见内科病例初级诊疗的潜力。


RRO: LLM Agent Optimization Through Rising Reward Trajectories

Abstract

arXiv:2505.20737v1 Announce Type: new Abstract: Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks while it remains challenging for them to solve complex multi-step tasks as agents. In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task because of a subtle mistake in the planning trajectory. Recent approaches resort to calibrating the reasoning process through reinforcement learning. They reward or penalize every reasoning step with process supervision, as known as Process Reward Models (PRMs). However, PRMs are difficult and costly to scale up with a large number of next action candidates since they require extensive computations to acquire the training data through the per-step trajectory exploration. To mitigate this issue, we focus on the relative reward trend across successive reasoning steps and propose maintaining an increasing reward in the collected trajectories for process supervision, which we term Reward Rising Optimization (RRO). Specifically, we incrementally augment the process supervision until identifying a step exhibiting positive reward differentials, i.e. rising rewards, relative to its preceding iteration. This method dynamically expands the search space for the next action candidates, efficiently capturing high-quality data. We provide mathematical groundings and empirical results on the WebShop and InterCode-SQL benchmarks, showing that our proposed RRO achieves superior performance while requiring much less exploration cost.

摘要

大型语言模型(LLMs)在多种任务中展现出卓越性能,但作为智能体解决复杂多步骤任务仍具挑战性。实践中,智能体对关键步骤结果高度敏感,规划轨迹中的细微错误极易导致任务失败。现有方法多采用强化学习校准推理过程,通过过程监督对每个推理步骤进行奖励或惩罚(即过程奖励模型PRMs)。然而,由于需通过逐步轨迹探索获取训练数据,PRMs在面临大量候选动作时难以扩展且计算成本高昂。为此,我们聚焦于连续推理步骤间的相对奖励趋势,提出在过程监督中保持收集轨迹的奖励递增,称为奖励上升优化(RRO)。具体而言,我们逐步增强过程监督,直至识别出相对于前次迭代呈现正奖励差异(即奖励上升)的步骤。该方法动态扩展候选动作的搜索空间,高效捕获高质量数据。我们在WebShop和InterCode-SQL基准测试中提供了数学依据和实证结果,表明所提RRO方法在显著降低探索成本的同时实现了更优性能。


GIFARC: Synthetic Dataset for Leveraging Human-Intuitive Analogies to Elevate AI Reasoning

Abstract

arXiv:2505.20672v1 Announce Type: new Abstract: The Abstraction and Reasoning Corpus (ARC) poses a stringent test of general AI capabilities, requiring solvers to infer abstract patterns from only a handful of examples. Despite substantial progress in deep learning, state-of-the-art models still achieve accuracy rates of merely 40-55% on 2024 ARC Competition, indicative of a significant gap between their performance and human-level reasoning. In this work, we seek to bridge that gap by introducing an analogy-inspired ARC dataset, GIFARC. Leveraging large language models (LLMs) and vision-language models (VLMs), we synthesize new ARC-style tasks from a variety of GIF images that include analogies. Each new task is paired with ground-truth analogy, providing an explicit mapping between visual transformations and everyday concepts. By embedding robust human-intuitive analogies into ARC-style tasks, GIFARC guides AI agents to evaluate the task analogically before engaging in brute-force pattern search, thus efficiently reducing problem complexity and build a more concise and human-understandable solution. We empirically validate that guiding LLM with analogic approach with GIFARC affects task-solving approaches of LLMs to align with analogic approach of human.

摘要

抽象与推理语料库(ARC)对通用人工智能能力提出了严格测试,要求求解者仅通过少量示例推断抽象模式。尽管深度学习已取得显著进展,但在2024年ARC竞赛中,最先进模型的准确率仍仅为40-55%,这表明其性能与人类水平推理存在显著差距。本研究通过引入受类比启发的ARC数据集GIFARC来弥合这一差距。我们利用大语言模型(LLMs)和视觉语言模型(VLMs),从包含类比的各类GIF图像中合成新的ARC式任务。每个新任务均配有真实类比,提供视觉变换与日常概念间的显式映射。通过将强健的人类直觉类比嵌入ARC式任务,GIFARC引导智能体在展开暴力模式搜索前进行类比评估,从而有效降低问题复杂度并构建更简洁、人类可理解的解决方案。实证研究表明,采用GIFARC的类比方法引导LLMs会影响其任务解决方式,使其与人类的类比推理方法保持一致。


MSEarth: A Benchmark for Multimodal Scientific Comprehension of Earth Science

Abstract

arXiv:2505.20740v1 Announce Type: new Abstract: The rapid advancement of multimodal large language models (MLLMs) has unlocked new opportunities to tackle complex scientific challenges. Despite this progress, their application in addressing earth science problems, especially at the graduate level, remains underexplored. A significant barrier is the absence of benchmarks that capture the depth and contextual complexity of geoscientific reasoning. Current benchmarks often rely on synthetic datasets or simplistic figure-caption pairs, which do not adequately reflect the intricate reasoning and domain-specific insights required for real-world scientific applications. To address these gaps, we introduce MSEarth, a multimodal scientific benchmark curated from high-quality, open-access scientific publications. MSEarth encompasses the five major spheres of Earth science: atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere, featuring over 7K figures with refined captions. These captions are crafted from the original figure captions and enriched with discussions and reasoning from the papers, ensuring the benchmark captures the nuanced reasoning and knowledge-intensive content essential for advanced scientific tasks. MSEarth supports a variety of tasks, including scientific figure captioning, multiple choice questions, and open-ended reasoning challenges. By bridging the gap in graduate-level benchmarks, MSEarth provides a scalable and high-fidelity resource to enhance the development and evaluation of MLLMs in scientific reasoning. The benchmark is publicly available to foster further research and innovation in this field. Resources related to this benchmark can be found at https://huggingface.co/MSEarth and https://github.com/xiangyu-mm/MSEarth.

摘要

多模态大语言模型(MLLMs)的快速发展为解决复杂科学问题提供了新机遇。然而,其在地球科学领域(尤其是研究生层面)的应用仍待探索,主要障碍在于缺乏能够体现地球科学推理深度与语境复杂性的基准测试。现有基准多依赖合成数据集或简化的图文配对,无法充分反映实际科学应用所需的复杂推理与领域洞见。为填补这一空白,我们推出MSEarth——一个基于高质量开放获取科学文献构建的多模态科学基准。该基准涵盖地球科学五大圈层(大气圈、冰冻圈、水圈、岩石圈和生物圈),包含7,000余幅配图及精炼标注。这些标注源自原始图注并融合论文中的讨论与推理,确保基准能捕捉高级科学任务所需的细微推理与知识密集型内容。MSEarth支持科学配图标注、多选题及开放式推理挑战等多种任务。通过弥补研究生级基准的空白,该资源为科学推理中MLLMs的开发与评估提供了可扩展的高保真工具。基准数据集已公开以促进相关研究创新,资源详见https://huggingface.co/MSEarth与https://github.com/xiangyu-mm/MSEarth。


MT-Mol:Multi Agent System with Tool-based Reasoning for Molecular Optimization

Abstract

arXiv:2505.20820v1 Announce Type: new Abstract: Large language models (LLMs) have large potential for molecular optimization, as they can gather external chemistry tools and enable collaborative interactions to iteratively refine molecular candidates. However, this potential remains underexplored, particularly in the context of structured reasoning, interpretability, and comprehensive tool-grounded molecular optimization. To address this gap, we introduce MT-Mol, a multi-agent framework for molecular optimization that leverages tool-guided reasoning and role-specialized LLM agents. Our system incorporates comprehensive RDKit tools, categorized into five distinct domains: structural descriptors, electronic and topological features, fragment-based functional groups, molecular representations, and miscellaneous chemical properties. Each category is managed by an expert analyst agent, responsible for extracting task-relevant tools and enabling interpretable, chemically grounded feedback. MT-Mol produces molecules with tool-aligned and stepwise reasoning through the interaction between the analyst agents, a molecule-generating scientist, a reasoning-output verifier, and a reviewer agent. As a result, we show that our framework shows the state-of-the-art performance of the PMO-1K benchmark on 17 out of 23 tasks.

摘要

大语言模型(LLMs)在分子优化领域具有巨大潜力,因其能够整合外部化学工具并通过协同交互实现候选分子的迭代优化。然而,这种潜力尤其在结构化推理、可解释性以及基于工具的综合分子优化方面尚未得到充分探索。为填补这一空白,我们提出了MT-Mol——一个基于工具引导推理与角色专业化LLM代理的多智能体分子优化框架。该系统整合了全面的RDKit工具集,并将其划分为五个独立领域:结构描述符、电子与拓扑特征、基于片段的官能团、分子表示以及杂项化学性质。每个领域由专业分析代理负责,其任务是提取任务相关工具并提供可解释的、基于化学原理的反馈。MT-Mol通过分析代理、分子生成科学家、推理输出验证器和评审代理之间的交互,产生具有工具对齐和逐步推理特性的分子。实验结果表明,我们的框架在PMO-1K基准测试的23项任务中有17项达到了当前最优性能。


Can Agents Fix Agent Issues?

Abstract

arXiv:2505.20749v1 Announce Type: new Abstract: LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AGENTISSUE-BENCH, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AGENTISSUE-BENCH and reveal their limited effectiveness (i.e., with only 3.33% - 12.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/ .

摘要

基于大语言模型的智能体系统正在成为一种新兴的软件范式,并已广泛应用于医疗、机器人和编程等多个领域。然而,维护这些系统需要大量投入,因为它们不可避免地存在缺陷,且需要持续演进以满足不断变化的外部需求。因此,自动解决智能体问题(即错误报告或功能需求)成为一项关键而具有挑战性的任务。尽管近期软件工程智能体(如SWE-agent)在解决传统软件系统问题方面展现出潜力,但其处理智能体系统中实际问题的有效性尚不明确,因为这类系统与传统软件存在显著差异。为填补这一空白,我们首先人工分析了201个真实场景中的智能体问题,识别出常见问题类别。随后投入500人时构建了AGENTISSUE-BENCH——一个包含50项智能体问题解决任务(每个任务均配备可执行环境及触发失败的测试用例)的可复现基准测试平台。我们进一步评估了当前最先进的软件工程智能体在该平台上的表现,发现其解决效率有限(仅3.33%-12.67%的解决率)。这些结果凸显了智能体系统维护相较于传统软件的独特挑战,表明需要进一步研究开发更先进的软件工程智能体来解决智能体问题。数据与代码详见https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/。


E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing

Abstract

arXiv:2505.20733v1 Announce Type: new Abstract: This paper presents an intelligent work automation approach in the context of contemporary digital transformation by integrating generative AI and Intelligent Document Processing (IDP) technologies with an Automation Agent to realize End-to-End (E2E) automation of corporate financial expense processing tasks. While traditional Robotic Process Automation (RPA) has proven effective for repetitive, rule-based simple task automation, it faces limitations in handling unstructured data, exception management, and complex decision-making. This study designs and implements a four-stage integrated process comprising automatic recognition of supporting documents such as receipts via OCR/IDP, item classification based on a policy-driven database, intelligent exception handling supported by generative AI (large language models, LLMs), and human-in-the-loop final decision-making with continuous system learning through an Automation Agent. Applied to a major Korean enterprise (Company S), the system demonstrated quantitative benefits including over 80% reduction in processing time for paper receipt expense tasks, decreased error rates, and improved compliance, as well as qualitative benefits such as enhanced accuracy and consistency, increased employee satisfaction, and data-driven decision support. Furthermore, the system embodies a virtuous cycle by learning from human judgments to progressively improve automatic exception handling capabilities. Empirically, this research confirms that the organic integration of generative AI, IDP, and Automation Agents effectively overcomes the limitations of conventional automation and enables E2E automation of complex corporate processes. The study also discusses potential extensions to other domains such as accounting, human resources, and procurement, and proposes future directions for AI-driven hyper-automation development.

摘要

本文提出了一种在当代数字化转型背景下的智能工作自动化方法,通过将生成式人工智能(AI)与智能文档处理(IDP)技术结合自动化代理(Automation Agent),实现企业财务费用处理任务的端到端(E2E)自动化。传统机器人流程自动化(RPA)虽在重复性、基于规则的简单任务自动化方面成效显著,但在处理非结构化数据、异常管理和复杂决策方面存在局限。本研究设计并实施了一个四阶段集成流程:通过OCR/IDP自动识别收据等证明文件、基于政策驱动数据库的项目分类、生成式AI(大语言模型LLM)支持的智能异常处理,以及人机协同最终决策与自动化代理持续学习的闭环系统。在韩国某大型企业(S公司)的应用表明,该系统实现了纸质收据费用任务处理时间减少80%以上、错误率降低与合规性提升等量化效益,以及准确性一致性增强、员工满意度提高和数据驱动决策支持等质性效益。系统通过从人工判断中学习,形成自动异常处理能力持续改进的良性循环。实证研究证实,生成式AI、IDP与自动化代理的有机结合能有效突破传统自动化局限,实现复杂企业流程的端到端自动化。研究还探讨了该方法在会计、人力资源和采购等领域的扩展潜力,并提出了AI驱动超自动化发展的未来方向。


MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems

Abstract

arXiv:2505.20824v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed in healthcare, ensuring their safety, particularly within collaborative multi-agent configurations, is paramount. In this paper we introduce MedSentry, a benchmark comprising 5 000 adversarial medical prompts spanning 25 threat categories with 100 subthemes. Coupled with this dataset, we develop an end-to-end attack-defense evaluation pipeline to systematically analyze how four representative multi-agent topologies (Layers, SharedPool, Centralized, and Decentralized) withstand attacks from 'dark-personality' agents. Our findings reveal critical differences in how these architectures handle information contamination and maintain robust decision-making, exposing their underlying vulnerability mechanisms. For instance, SharedPool's open information sharing makes it highly susceptible, whereas Decentralized architectures exhibit greater resilience thanks to inherent redundancy and isolation. To mitigate these risks, we propose a personality-scale detection and correction mechanism that identifies and rehabilitates malicious agents, restoring system safety to near-baseline levels. MedSentry thus furnishes both a rigorous evaluation framework and practical defense strategies that guide the design of safer LLM-based multi-agent systems in medical domains.

摘要

随着大型语言模型(LLMs)在医疗领域日益广泛应用,确保其安全性——尤其在协作多智能体配置中——变得至关重要。本文提出MedSentry基准测试集,包含涵盖25个威胁类别、100个子主题的5000条对抗性医疗提示。结合该数据集,我们开发了端到端的攻防评估流程,系统分析四种代表性多智能体拓扑结构(层级式、共享池式、集中式和分布式)如何抵御"暗黑人格"智能体的攻击。研究发现这些架构在应对信息污染和保持稳健决策方面存在关键差异,暴露出其底层脆弱机制。例如,共享池式架构因开放信息共享而极易受攻击,而分布式架构凭借固有冗余和隔离展现出更强韧性。为降低风险,我们提出基于人格量表的检测校正机制,可识别并修复恶意智能体,使系统安全性恢复至接近基线水平。MedSentry不仅提供严谨的评估框架,还提出实用防御策略,为医疗领域基于LLM的多智能体系统设计更安全的方案。


Research on a Two-Layer Demand Response Framework for Electric Vehicle Users and Aggregators Based on LLMs

Abstract

arXiv:2505.20877v1 Announce Type: new Abstract: The widespread adoption of electric vehicles (EVs) has increased the importance of demand response in smart grids. This paper proposes a two-layer demand response optimization framework for EV users and aggregators, leveraging large language models (LLMs) to balance electricity supply and demand and optimize energy utilization during EV charging. The upper-layer model, focusing on the aggregator, aims to maximize profits by adjusting retail electricity prices. The lower-layer model targets EV users, using LLMs to simulate charging demands under varying electricity prices and optimize both costs and user comfort. The study employs a multi-threaded LLM decision generator to dynamically analyze user behavior, charging preferences, and psychological factors. The framework utilizes the PSO method to optimize electricity prices, ensuring user needs are met while increasing aggregator profits. Simulation results show that the proposed model improves EV charging efficiency, alleviates peak power loads, and stabilizes smart grid operations.

摘要

摘要:电动汽车(EV)的广泛普及提升了智能电网中需求响应的重要性。本文提出一种面向EV用户与聚合商的双层需求响应优化框架,利用大语言模型(LLM)平衡充电过程中的电力供需并优化能源利用。上层模型聚焦聚合商视角,通过调整零售电价实现利润最大化;下层模型针对EV用户,采用LLM模拟不同电价下的充电需求,优化成本与用户舒适度。研究通过多线程LLM决策生成器动态分析用户行为、充电偏好及心理因素,运用粒子群优化(PSO)方法进行电价优化,在满足用户需求的同时提升聚合商收益。仿真结果表明,该模型能有效提升EV充电效率、缓解电网峰值负荷并稳定智能电网运行。


Step-Wise Formal Verification for LLM-Based Mathematical Problem Solving

Abstract

arXiv:2505.20869v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated formidable capabilities in solving mathematical problems, yet they may still commit logical reasoning and computational errors during the problem-solving process. Thus, this paper proposes a framework, MATH-VF, which includes a Formalizer and a Critic, for formally verifying the correctness of the solutions generated by large language models. Our framework first utilizes a Formalizer which employs an LLM to translate a natural language solution into a formal context. Afterward, our Critic (which integrates various external tools such as a Computer Algebra System and an SMT solver) evaluates the correctness of each statement within the formal context, and when a statement is incorrect, our Critic provides corrective feedback. We empirically investigate the effectiveness of MATH-VF in two scenarios: 1) Verification: MATH-VF is utilized to determine the correctness of a solution to a given problem. 2) Refinement: When MATH-VF identifies errors in the solution generated by an LLM-based solution generator for a given problem, it submits the corrective suggestions proposed by the Critic to the solution generator to regenerate the solution. We evaluate our framework on widely used mathematical benchmarks: MATH500 and ProcessBench, demonstrating the superiority of our approach over existing approaches.

摘要

大语言模型(LLMs)在解决数学问题方面展现出强大能力,但其求解过程仍可能出现逻辑推理与计算错误。为此,本文提出MATH-VF框架,通过整合形式化转换器(Formalizer)与验证器(Critic)来实现对大语言模型生成解法的形式化验证。该框架首先利用基于LLM的形式化转换器将自然语言解法转换为形式化表述,随后由验证器(集成计算机代数系统、SMT求解器等外部工具)对形式化语境中的每个陈述进行正确性评估。当发现错误陈述时,验证器将生成修正反馈。我们通过实证研究验证MATH-VF在两种场景下的有效性:1)验证场景:用于判定给定问题解法的正确性;2)优化场景:当LLM解法生成器针对给定问题产生错误解时,将验证器提出的修正建议反馈至生成器以重新生成解法。我们在广泛使用的数学基准测试集MATH500和ProcessBench上评估本框架,实验结果证明该方法优于现有技术方案。


Agent-Environment Alignment via Automated Interface Generation

Abstract

arXiv:2505.21055v1 Announce Type: new Abstract: Large language model (LLM) agents have shown impressive reasoning capabilities in interactive decision-making tasks. These agents interact with environment through intermediate interfaces, such as predefined action spaces and interaction rules, which mediate the perception and action. However, mismatches often happen between the internal expectations of the agent regarding the influence of its issued actions and the actual state transitions in the environment, a phenomenon referred to as \textbf{agent-environment misalignment}. While prior work has invested substantially in improving agent strategies and environment design, the critical role of the interface still remains underexplored. In this work, we empirically demonstrate that agent-environment misalignment poses a significant bottleneck to agent performance. To mitigate this issue, we propose \textbf{ALIGN}, an \underline{A}uto-A\underline{l}igned \underline{I}nterface \underline{G}e\underline{n}eration framework that alleviates the misalignment by enriching the interface. Specifically, the ALIGN-generated interface enhances both the static information of the environment and the step-wise observations returned to the agent. Implemented as a lightweight wrapper, this interface achieves the alignment without modifying either the agent logic or the environment code. Experiments across multiple domains including embodied tasks, web navigation and tool-use, show consistent performance improvements, with up to a 45.67% success rate improvement observed in ALFWorld. Meanwhile, ALIGN-generated interface can generalize across different agent architectures and LLM backbones without interface regeneration. Code and experimental results are available at https://github.com/THUNLP-MT/ALIGN.


Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning

Abstract

arXiv:2505.21067v1 Announce Type: new Abstract: Reinforcement learning (RL) has played an important role in improving the reasoning ability of large language models (LLMs). Some studies apply RL directly to \textit{smaller} base models (known as zero-RL) and also achieve notable progress. However, in this paper, we show that using only 920 examples, a simple distillation method based on the base model can clearly outperform zero-RL, which typically requires much more data and computational cost. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors give rise to flexible reasoning, which is essential for solving complex reasoning problems, while zero-RL fails to significantly boost the frequency of these behaviors.

摘要

强化学习(RL)在提升大语言模型(LLMs)的推理能力方面发挥了重要作用。现有研究直接将RL应用于较小规模的基础模型(称为零RL方法),也取得了显著进展。然而,本文研究表明,仅需920个样本,基于基础模型的简单蒸馏方法即可明显超越通常需要更多数据和计算成本的零RL方法。通过分析模型输出的词元频率,我们发现蒸馏模型展现出更灵活的推理能力:其使用拟人化词元和逻辑连接词的频率显著高于零RL模型。进一步分析表明,蒸馏方法增强了两种高级认知行为的出现频率:多视角思考/尝试以及元认知意识。这两种高级认知行为的频繁出现催生了灵活的推理能力,而这正是解决复杂推理问题的关键;而零RL方法则未能显著提升这些行为的出现频率。


Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation

Abstract

arXiv:2505.21106v1 Announce Type: new Abstract: Large Vision Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet they also exhibit notable social biases. These biases often manifest as unintended associations between neutral concepts and sensitive human attributes, leading to disparate model behaviors across demographic groups. While existing studies primarily focus on detecting and quantifying such biases, they offer limited insight into the underlying mechanisms within the models. To address this gap, we propose an explanatory framework that combines information flow analysis with multi-round dialogue evaluation, aiming to understand the origin of social bias from the perspective of imbalanced internal information utilization. Specifically, we first identify high-contribution image tokens involved in the model's reasoning process for neutral questions via information flow analysis. Then, we design a multi-turn dialogue mechanism to evaluate the extent to which these key tokens encode sensitive information. Extensive experiments reveal that LVLMs exhibit systematic disparities in information usage when processing images of different demographic groups, suggesting that social bias is deeply rooted in the model's internal reasoning dynamics. Furthermore, we complement our findings from a textual modality perspective, showing that the model's semantic representations already display biased proximity patterns, thereby offering a cross-modal explanation of bias formation.

摘要

尽管大规模视觉语言模型(LVLMs)在多模态任务中取得了显著进展,但它们也表现出明显的社会偏见。这些偏见通常表现为中性概念与敏感人类属性之间的非预期关联,导致模型在不同人口统计群体中表现出差异化的行为。现有研究主要集中于检测和量化此类偏见,但对模型内部潜在机制的解释较为有限。为填补这一空白,我们提出一个结合信息流分析与多轮对话评估的解释性框架,旨在从内部信息利用失衡的角度理解社会偏见的起源。具体而言,我们首先通过信息流分析识别模型在回答中性问题时推理过程中涉及的高贡献图像标记;随后设计多轮对话机制评估这些关键标记对敏感信息的编码程度。大量实验表明,LVLMs在处理不同人口群体图像时存在系统性的信息使用差异,表明社会偏见深植于模型的内部推理动态中。此外,我们从文本模态角度补充研究发现,证明模型的语义表征已呈现有偏见的邻近模式,从而为偏见形成提供了跨模态解释。


Large Language Model-enhanced Reinforcement Learning for Low-Altitude Economy Networking

Abstract

arXiv:2505.21045v1 Announce Type: new Abstract: Low-Altitude Economic Networking (LAENet) aims to support diverse flying applications below 1,000 meters by deploying various aerial vehicles for flexible and cost-effective aerial networking. However, complex decision-making, resource constraints, and environmental uncertainty pose significant challenges to the development of the LAENet. Reinforcement learning (RL) offers a potential solution in response to these challenges but has limitations in generalization, reward design, and model stability. The emergence of large language models (LLMs) offers new opportunities for RL to mitigate these limitations. In this paper, we first present a tutorial about integrating LLMs into RL by using the capacities of generation, contextual understanding, and structured reasoning of LLMs. We then propose an LLM-enhanced RL framework for the LAENet in terms of serving the LLM as information processor, reward designer, decision-maker, and generator. Moreover, we conduct a case study by using LLMs to design a reward function to improve the learning performance of RL in the LAENet. Finally, we provide a conclusion and discuss future work.

摘要

低空经济网络(LAENet)旨在通过部署各类飞行器,在1000米以下空域为多样化飞行应用提供灵活且经济高效的空中组网服务。然而,复杂决策制定、资源限制及环境不确定性对LAENet的发展构成重大挑战。强化学习(RL)虽能应对这些挑战,但在泛化性、奖励函数设计和模型稳定性方面存在局限。大语言模型(LLM)的出现为缓解这些局限提供了新机遇。本文首先通过利用LLM的生成能力、上下文理解与结构化推理特性,提出将LLM与RL融合的教程框架;继而构建面向LAENet的LLM增强型RL框架,使LLM承担信息处理器、奖励函数设计器、决策生成器等多重角色。此外,我们通过案例研究验证了LLM设计奖励函数对提升LAENet中RL学习性能的有效性。最后总结研究结论并展望未来工作方向。


Diagnosing and Resolving Cloud Platform Instability with Multi-modal RAG LLMs

Abstract

arXiv:2505.21419v1 Announce Type: new Abstract: Today's cloud-hosted applications and services are complex systems, and a performance or functional instability can have dozens or hundreds of potential root causes. Our hypothesis is that by combining the pattern matching capabilities of modern AI tools with a natural multi-modal RAG LLM interface, problem identification and resolution can be simplified. ARCA is a new multi-modal RAG LLM system that targets this domain. Step-wise evaluations show that ARCA outperforms state-of-the-art alternatives.

摘要

当今云托管应用程序和服务是复杂系统,性能或功能不稳定可能由数十甚至数百种潜在根源引起。我们的假设是:通过将现代AI工具的模式匹配能力与自然多模态RAG大语言模型界面相结合,可以简化问题识别与解决过程。ARCA是一种面向该领域的新型多模态RAG大语言模型系统。阶段性评估表明,ARCA在性能上优于现有最先进方案。


The Multilingual Divide and Its Impact on Global AI Safety

Abstract

arXiv:2505.21344v1 Announce Type: new Abstract: Despite advances in large language model capabilities in recent years, a large gap remains in their capabilities and safety performance for many languages beyond a relatively small handful of globally dominant languages. This paper provides researchers, policymakers and governance experts with an overview of key challenges to bridging the "language gap" in AI and minimizing safety risks across languages. We provide an analysis of why the language gap in AI exists and grows, and how it creates disparities in global AI safety. We identify barriers to address these challenges, and recommend how those working in policy and governance can help address safety concerns associated with the language gap by supporting multilingual dataset creation, transparency, and research.

摘要

尽管近年来大语言模型能力取得进展,但在全球少数主流语言之外,大多数语言的模型能力与安全性能仍存在显著差距。本文为研究人员、政策制定者和治理专家系统阐述了弥合人工智能"语言鸿沟"及降低多语言安全风险的关键挑战。我们深入分析了AI语言鸿沟存在并扩大的根源,及其如何导致全球AI安全领域的失衡发展。通过识别应对这些挑战的主要障碍,我们为政策与治理工作者提出建议:通过支持多语言数据集构建、提升透明度及加强相关研究,来应对由语言鸿沟引发的安全隐患。


Large Language Models Miss the Multi-Agent Mark

Abstract

arXiv:2505.21298v1 Announce Type: new Abstract: Recent interest in Multi-Agent Systems of Large Language Models (MAS LLMs) has led to an increase in frameworks leveraging multiple LLMs to tackle complex tasks. However, much of this literature appropriates the terminology of MAS without engaging with its foundational principles. In this position paper, we highlight critical discrepancies between MAS theory and current MAS LLMs implementations, focusing on four key areas: the social aspect of agency, environment design, coordination and communication protocols, and measuring emergent behaviours. Our position is that many MAS LLMs lack multi-agent characteristics such as autonomy, social interaction, and structured environments, and often rely on oversimplified, LLM-centric architectures. The field may slow down and lose traction by revisiting problems the MAS literature has already addressed. Therefore, we systematically analyse this issue and outline associated research opportunities; we advocate for better integrating established MAS concepts and more precise terminology to avoid mischaracterisation and missed opportunities.

摘要

近期对大型语言模型多智能体系统(MAS LLMs)的关注,促使越来越多框架利用多个LLM处理复杂任务。然而,现有研究大多套用了MAS术语却未涉及其基础理论。本立场论文揭示了MAS理论与当前MAS LLMs实践间的关键差异,聚焦四个核心维度:智能体的社会属性、环境设计、协调与通信协议以及涌现行为测量。我们认为多数MAS LLMs缺乏自主性、社会交互和结构化环境等多智能体特征,往往依赖过度简化的LLM中心架构。若忽视MAS文献已解决的问题,该领域发展可能受阻并丧失潜力。为此,我们系统分析了这一现状并指出相关研究机遇,主张通过整合成熟的MAS概念和采用更精确的术语体系,避免误释并把握发展契机。


Out of the Past: An AI-Enabled Pipeline for Traffic Simulation from Noisy, Multimodal Detector Data and Stakeholder Feedback

Abstract

arXiv:2505.21349v1 Announce Type: new Abstract: How can a traffic simulation be designed to faithfully reflect real-world traffic conditions? Past data-driven approaches to traffic simulation in the literature have relied on unrealistic or suboptimal heuristics. They also fail to adequately account for the effects of uncertainty and multimodality in the data on simulation outcomes. In this work, we integrate advances in AI to construct a three-step, end-to-end pipeline for generating a traffic simulation from detector data: computer vision for vehicle counting from camera footage, combinatorial optimization for vehicle route generation from multimodal data, and large language models for iterative simulation refinement from natural language feedback. Using a road network from Strongsville, Ohio as a testbed, we demonstrate that our pipeline can accurately capture the city's traffic patterns in a granular simulation. Beyond Strongsville, our traffic simulation framework can be generalized to other municipalities with different levels of data and infrastructure availability.

摘要

如何设计一个能真实反映现实世界交通状况的交通仿真系统?以往文献中基于数据驱动的交通仿真方法往往依赖于不现实或次优的启发式规则,且未能充分考虑数据中的不确定性和多模态特性对仿真结果的影响。本研究整合人工智能领域的最新进展,构建了一个从检测器数据生成交通仿真的三步骤端到端流程:利用计算机视觉技术从监控视频中提取车辆计数,通过组合优化方法从多模态数据生成车辆路径,并运用大语言模型根据自然语言反馈进行迭代仿真优化。以俄亥俄州斯特朗斯维尔的道路网络为测试平台,我们证明该流程能够通过精细化仿真准确捕捉该城市的交通模式。该交通仿真框架可进一步推广至具有不同数据水平和基础设施条件的其他城市区域。


MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Abstract

arXiv:2505.21327v1 Announce Type: new Abstract: Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.

摘要

逻辑推理是人类智能的核心要素,也是多模态大语言模型(MLLMs)的关键能力。尽管多模态推理研究取得了显著进展,但由于缺乏对逻辑推理类型的明确分类以及对推理本质的理解不足,现有基准测试难以全面评估模型的推理能力。为此,我们提出MME-Reasoning——一个全面评估MLLMs推理能力的基准测试,其问题涵盖归纳、演绎和溯因三类基本推理形式。我们通过严格的数据筛选确保每个问题都能有效评估推理能力而非感知技能或知识广度,并扩展评估协议以覆盖多样化问题的评测。实验表明,当对逻辑推理能力进行整体评估时,最先进的MLLMs仍存在显著局限:即便最先进的模型在综合逻辑推理中也表现有限,且不同推理类型间存在明显性能失衡。此外,我们对'思维模式'和基于规则的强化学习等常用推理增强方法进行了深入分析。这些发现揭示了当前MLLMs在多样化逻辑推理场景中的关键局限与性能失衡,为理解与评估推理能力提供了系统化的见解。


Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

Abstract

arXiv:2505.21318v1 Announce Type: new Abstract: While large language models (LLMs) with Chain-of-Thought (CoT) reasoning excel in mathematics and coding, their potential for systematic reasoning in chemistry, a domain demanding rigorous structural analysis for real-world tasks like drug design and reaction engineering, remains untapped. Current benchmarks focus on simple knowledge retrieval, neglecting step-by-step reasoning required for complex tasks such as molecular optimization and reaction prediction. To address this, we introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations, including addition, deletion, and substitution, to formalize chemical problem-solving into transparent, step-by-step workflows. By treating molecular transformations as modular "chemical operations", the framework enables slow-thinking reasoning, mirroring the logic of mathematical proofs while grounding solutions in real-world chemical constraints. We evaluate models on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction. These tasks mirror real-world challenges while providing structured evaluability. By providing annotated datasets, a reasoning taxonomy, and baseline evaluations, ChemCoTBench bridges the gap between abstract reasoning methods and practical chemical discovery, establishing a foundation for advancing LLMs as tools for AI-driven scientific innovation.

摘要

尽管具备思维链(CoT)推理能力的大语言模型(LLM)在数学和编程领域表现出色,但其在化学领域进行系统性推理的潜力尚未被发掘——该领域需要严格的分子结构分析以应对药物设计和反应工程等实际任务。现有基准测试主要关注简单知识检索,忽视了分子优化与反应预测等复杂任务所需的逐步推理能力。为此,我们提出ChemCoTBench推理框架,通过将分子结构理解与算术化操作(包括添加、删除和替换)相结合,将化学问题解决形式化为透明、分步骤的工作流程。该框架将分子转化视为模块化的"化学操作",支持慢思考推理模式,既遵循数学证明的逻辑,又将解决方案锚定于现实化学约束。我们在两个高影响力任务(分子性质优化与化学反应预测)上评估模型性能,这些任务既反映实际挑战又具备结构化可评估性。通过提供标注数据集、推理分类体系和基线评估结果,ChemCoTBench填补了抽象推理方法与实用化学发现之间的鸿沟,为推进LLM成为AI驱动科学创新的工具奠定基础。


Complex System Diagnostics Using a Knowledge Graph-Informed and Large Language Model-Enhanced Framework

Abstract

arXiv:2505.21291v1 Announce Type: new Abstract: In this paper, we present a novel diagnostic framework that integrates Knowledge Graphs (KGs) and Large Language Models (LLMs) to support system diagnostics in high-reliability systems such as nuclear power plants. Traditional diagnostic modeling struggles when systems become too complex, making functional modeling a more attractive approach. Our approach introduces a diagnostic framework grounded in the functional modeling principles of the Dynamic Master Logic (DML) model. It incorporates two coordinated LLM components, including an LLM-based workflow for automated construction of DML logic from system documentation and an LLM agent that facilitates interactive diagnostics. The generated logic is encoded into a structured KG, referred to as KG-DML, which supports hierarchical fault reasoning. Expert knowledge or operational data can also be incorporated to refine the model's precision and diagnostic depth. In the interaction phase, users submit natural language queries, which are interpreted by the LLM agent. The agent selects appropriate tools for structured reasoning, including upward and downward propagation across the KG-DML. Rather than embedding KG content into every prompt, the LLM agent distinguishes between diagnostic and interpretive tasks. For diagnostics, the agent selects and executes external tools that perform structured KG reasoning. For general queries, a Graph-based Retrieval-Augmented Generation (Graph-RAG) approach is used, retrieving relevant KG segments and embedding them into the prompt to generate natural explanations. A case study on an auxiliary feedwater system demonstrated the framework's effectiveness, with over 90% accuracy in key elements and consistent tool and argument extraction, supporting its use in safety-critical diagnostics.

摘要

本文提出了一种集成知识图谱(KGs)与大语言模型(LLMs)的新型诊断框架,用于支持核电站等高可靠性系统的故障诊断。当系统过于复杂时,传统诊断建模方法面临挑战,这使得功能建模成为更具吸引力的解决方案。我们的方法基于动态主逻辑(DML)模型的功能建模原理,构建了一个包含两个协同LLM组件的诊断框架:一个用于从系统文档自动构建DML逻辑的LLM工作流,以及一个支持交互式诊断的LLM智能体。生成的逻辑被编码为结构化知识图谱(KG-DML),支持分层故障推理。专家知识或运行数据可被纳入以提升模型精度和诊断深度。在交互阶段,用户提交自然语言查询,由LLM智能体解析后选择适当工具进行结构化推理(包括KG-DML的上下行传播)。该智能体区分诊断任务与解释任务:对于诊断任务,选择并执行外部工具进行结构化图谱推理;对于一般查询,采用基于图谱的检索增强生成(Graph-RAG)方法,检索相关图谱片段并嵌入提示词以生成自然语言解释。辅助给水系统的案例研究表明,该框架在关键要素上准确率超过90%,工具与参数提取结果稳定,验证了其在安全关键诊断中的适用性。


Autonomous Multi-Modal LLM Agents for Treatment Planning in Focused Ultrasound Ablation Surgery

Abstract

arXiv:2505.21418v1 Announce Type: new Abstract: Focused Ultrasound Ablation Surgery (FUAS) has emerged as a promising non-invasive therapeutic modality, valued for its safety and precision. Nevertheless, its clinical implementation entails intricate tasks such as multimodal image interpretation, personalized dose planning, and real-time intraoperative decision-making processes that demand intelligent assistance to improve efficiency and reliability. We introduce FUAS-Agents, an autonomous agent system that leverages the multimodal understanding and tool-using capabilities of large language models (LLMs). By integrating patient profiles and MRI data, FUAS-Agents orchestrates a suite of specialized medical AI tools, including segmentation, treatment dose prediction, and clinical guideline retrieval, to generate personalized treatment plans comprising MRI image, dose parameters, and therapeutic strategies. We evaluate the system in a uterine fibroid treatment scenario. Human assessment by four senior FUAS experts indicates that 82.5%, 82.5%, 87.5%, and 97.5% of the generated plans were rated 4 or above (on a 5-point scale) in terms of completeness, accuracy, fluency, and clinical compliance, respectively. These results demonstrate the potential of LLM-driven agents in enhancing decision-making across complex clinical workflows, and exemplify a translational paradigm that combines general-purpose models with specialized expert systems to solve practical challenges in vertical healthcare domains.

摘要

聚焦超声消融手术(FUAS)作为一种安全精准的无创治疗手段,已展现出显著临床应用前景。然而其实施过程涉及多模态影像解析、个性化剂量规划和实时术中决策等复杂任务,亟需智能辅助系统以提升效率与可靠性。本研究提出FUAS-Agents自主代理系统,通过整合大型语言模型(LLMs)的多模态理解与工具调用能力,协同患者资料与MRI数据,调度包括影像分割、治疗剂量预测和临床指南检索在内的专业医疗AI工具,生成涵盖MRI图像、剂量参数及治疗策略的个性化方案。在子宫肌瘤治疗场景的评估中,四位资深FUAS专家人工评审显示:生成方案在完整性、准确性、流畅性和临床合规性方面分别获得82.5%、82.5%、87.5%和97.5%的4分及以上评分(5分制)。该结果证实了LLM驱动代理在优化复杂临床决策流程方面的潜力,同时为通用模型与垂直领域专家系统的协同转化提供了范式,以解决医疗健康领域的实际挑战。


Policy Induction: Predicting Startup Success via Explainable Memory-Augmented In-Context Learning

Abstract

arXiv:2505.21427v1 Announce Type: new Abstract: Early-stage startup investment is a high-risk endeavor characterized by scarce data and uncertain outcomes. Traditional machine learning approaches often require large, labeled datasets and extensive fine-tuning, yet remain opaque and difficult for domain experts to interpret or improve. In this paper, we propose a transparent and data-efficient investment decision framework powered by memory-augmented large language models (LLMs) using in-context learning (ICL). Central to our method is a natural language policy embedded directly into the LLM prompt, enabling the model to apply explicit reasoning patterns and allowing human experts to easily interpret, audit, and iteratively refine the logic. We introduce a lightweight training process that combines few-shot learning with an in-context learning loop, enabling the LLM to update its decision policy iteratively based on structured feedback. With only minimal supervision and no gradient-based optimization, our system predicts startup success far more accurately than existing benchmarks. It is over 20x more precise than random chance, which succeeds 1.9% of the time. It is also 7.1x more precise than the typical 5.6% success rate of top-tier venture capital (VC) firms.

摘要

早期初创企业投资是一项高风险活动,其特点是数据稀缺且结果不确定。传统机器学习方法通常需要大量标注数据和精细调参,却仍缺乏透明度,难以让领域专家理解或改进。本文提出一种由记忆增强型大语言模型(LLM)驱动的透明高效投资决策框架,采用上下文学习(ICL)方法。我们的方法核心是将自然语言策略直接嵌入LLM提示中,使模型能够应用显式推理模式,并允许人类专家轻松解读、审核和迭代优化逻辑。我们引入一种轻量级训练流程,将小样本学习与上下文学习循环相结合,使LLM能够基于结构化反馈迭代更新决策策略。仅需极少量监督且无需基于梯度的优化,我们的系统对初创企业成功率的预测精度远超现有基准:其预测准确率比随机猜测(1.9%成功率)高出20倍以上,比顶级风险投资机构(VC)5.6%的平均成功率高出7.1倍。


Robust Hypothesis Generation: LLM-Automated Language Bias for Inductive Logic Programming

Abstract

arXiv:2505.21486v1 Announce Type: new Abstract: Automating robust hypothesis generation in open environments is pivotal for AI cognition. We introduce a novel framework integrating a multi-agent system, powered by Large Language Models (LLMs), with Inductive Logic Programming (ILP). Our system's LLM agents autonomously define a structured symbolic vocabulary (predicates) and relational templates , i.e., \emph{language bias} directly from raw textual data. This automated symbolic grounding (the construction of the language bias), traditionally an expert-driven bottleneck for ILP, then guides the transformation of text into facts for an ILP solver, which inductively learns interpretable rules. This approach overcomes traditional ILP's reliance on predefined symbolic structures and the noise-sensitivity of pure LLM methods. Extensive experiments in diverse, challenging scenarios validate superior performance, paving a new path for automated, explainable, and verifiable hypothesis generation.

摘要

在开放环境中实现稳健假设生成的自动化对于人工智能认知至关重要。本研究提出了一种创新框架,将基于大语言模型(LLMs)的多智能体系统与归纳逻辑编程(ILP)相结合。我们的系统通过LLM智能体直接从原始文本数据中自主定义结构化符号词汇(谓词)和关系模板,即语言偏置。这种自动化的符号接地(语言偏置的构建)传统上是ILP专家驱动的瓶颈环节,随后指导文本转化为ILP求解器所需的事实,进而归纳学习可解释的规则。该方法克服了传统ILP对预定义符号结构的依赖,以及纯LLM方法对噪声敏感的缺陷。在多种复杂场景下开展的大量实验验证了其卓越性能,为自动化、可解释且可验证的假设生成开辟了新途径。


SpatialLLM: From Multi-modality Data to Urban Spatial Intelligence

Abstract

arXiv:2505.12703v1 Announce Type: cross Abstract: We propose SpatialLLM, a novel approach advancing spatial intelligence tasks in complex urban scenes. Unlike previous methods requiring geographic analysis tools or domain expertise, SpatialLLM is a unified language model directly addressing various spatial intelligence tasks without any training, fine-tuning, or expert intervention. The core of SpatialLLM lies in constructing detailed and structured scene descriptions from raw spatial data to prompt pre-trained LLMs for scene-based analysis. Extensive experiments show that, with our designs, pretrained LLMs can accurately perceive spatial distribution information and enable zero-shot execution of advanced spatial intelligence tasks, including urban planning, ecological analysis, traffic management, etc. We argue that multi-field knowledge, context length, and reasoning ability are key factors influencing LLM performances in urban analysis. We hope that SpatialLLM will provide a novel viable perspective for urban intelligent analysis and management. The code and dataset are available at https://github.com/WHU-USI3DV/SpatialLLM.

摘要

我们提出SpatialLLM这一创新方法,旨在推进复杂城市场景中的空间智能任务研究。与以往需要地理分析工具或领域专业知识的方法不同,SpatialLLM是一个统一的语言模型,无需任何训练、微调或专家干预即可直接处理各类空间智能任务。该模型的核心在于从原始空间数据构建详细且结构化的场景描述,从而驱动预训练大语言模型进行场景化分析。大量实验表明,通过我们的设计方案,预训练大语言模型能够准确感知空间分布信息,实现包括城市规划、生态分析、交通管理等高级空间智能任务的零样本执行。我们认为跨领域知识、上下文长度和推理能力是影响大语言模型城市分析性能的关键因素。希望SpatialLLM能为城市智能分析与管理提供全新的可行视角。代码与数据集详见https://github.com/WHU-USI3DV/SpatialLLM。


Large Language Model-Powered Decision Support for a Metal Additive Manufacturing Knowledge Graph

Abstract

arXiv:2505.20308v1 Announce Type: cross Abstract: Metal additive manufacturing (AM) involves complex interdependencies among processes, materials, feedstock, and post-processing steps. However, the underlying relationships and domain knowledge remain fragmented across literature and static databases that often demand expert-level queries, limiting their applicability in design and planning. To address these gaps, we develop a novel and queryable knowledge graph (KG) in Neo4j, encoding 53 distinct metals and alloys across seven material families, nine AM processes, four feedstock types, and associated post-processing requirements. A large language model (LLM) interface, guided by a few-shot prompting strategy, enables natural language querying without the need for formal query syntax. The system supports a range of tasks, including compatibility checks, multi-constraint filtering, and design for AM (DfAM) guidance. User natural language queries are normalized, translated into Cypher, and executed over the KG, with results reformatted into structured responses. This work presents the first real-time, interactive system that integrates a domain-specific metal AM KG with an LLM interface, offering accessible, explainable decision support for engineers and advancing human-centric tools in manufacturing intelligence.

摘要

金属增材制造(AM)涉及工艺、材料、原料和后处理步骤之间复杂的相互依赖关系。然而,现有文献和静态数据库中这些底层关联与领域知识仍呈碎片化分布,且通常需要专家级查询,限制了其在设计与规划中的应用。为填补这一空白,我们在Neo4j中构建了一个新颖的可查询知识图谱(KG),编码了涵盖7类材料家族、9种AM工艺、4类原料类型及相关后处理要求的53种金属与合金。通过采用少样本提示策略指导的大语言模型(LLM)接口,系统支持自然语言查询而无需正式查询语法。该体系可实现兼容性检查、多约束筛选和面向增材制造的设计(DfAM)指导等多种任务。用户自然语言查询经规范化处理后转换为Cypher语句,在KG上执行并将结果重构为结构化响应。本研究首次提出将特定领域金属AM知识图谱与LLM接口结合的实时交互系统,为工程师提供易用、可解释的决策支持,推动了制造智能化领域以人为本的工具发展。


ShIOEnv: A CLI Behavior-Capturing Environment Enabling Grammar-Guided Command Synthesis for Dataset Curation

Abstract

arXiv:2505.18374v1 Announce Type: cross Abstract: Command-line interfaces (CLIs) provide structured textual environments for system administration. Explorations have been performed using pre-trained language models (PLMs) to simulate these environments for safe interaction in high-risk environments. However, their use has been constrained to frozen, large parameter models like GPT. For smaller architectures to reach a similar level of believability, a rich dataset of CLI interactions is required. Existing public datasets focus on mapping natural-language tasks to commands, omitting crucial execution data such as exit codes, outputs, and environmental side effects, limiting their usability for behavioral modeling. We introduce a Shell Input -Output Environment (ShIOEnv), which casts command construction as a Markov Decision Process whose state is the partially built sequence and whose actions append arguments. After each action, ShIOEnv executes the candidate and returns its exit status, output, and progress toward a minimal-length behavioral objective. Due to the intractable nature of the combinatorial argument state-action space, we derive a context-free grammar from man pages to mask invalid arguments from being emitted. We explore random and proximal-policy optimization (PPO)-optimized sampling of unrestricted and grammar-masked action spaces to produce four exploration strategies. We observed that grammar masking and PPO significantly improve sample efficiency to produce a higher quality dataset (maximizing the number of arguments while minimizing redundancies). Policy-generated datasets of shell input-output behavior pairs are used to fine-tune CodeT5, where we observe 85% improvements in BLEU-4 when constraining the action space to grammar productions with an additional 26% improvement when applying PPO. The ShIOEnv environment and curated command behavior datasets are released for use in future research.

摘要

命令行界面(CLI)为系统管理提供了结构化的文本环境。已有研究利用预训练语言模型(PLM)模拟这些环境,以实现高风险场景下的安全交互。然而,此类应用目前仅限于GPT等大规模参数冻结模型。要使较小架构达到类似的拟真度,需要丰富的CLI交互数据集。现有公开数据集主要关注自然语言任务到命令的映射,忽略了退出码、输出和环境副作用等关键执行数据,限制了其在行为建模中的可用性。我们提出Shell输入-输出环境(ShIOEnv),将命令构建建模为马尔可夫决策过程:状态是部分构建的序列,动作为追加参数。每次动作后,ShIOEnv执行候选命令并返回其退出状态、输出及面向最小长度行为目标的进度。针对组合参数状态-动作空间的组合爆炸问题,我们从手册页推导出上下文无关文法以屏蔽无效参数。我们通过随机采样和近端策略优化(PPO)对比研究了无约束与文法屏蔽动作空间的四种探索策略。实验表明,文法屏蔽和PPO能显著提升采样效率,生成更高质量数据集(最大化参数数量同时最小化冗余)。策略生成的Shell输入-输出行为对数据集用于微调CodeT5模型,结果显示:当动作空间约束为文法产生式时BLEU-4提升85%,结合PPO后进一步提升26%。ShIOEnv环境与整理的命令行为数据集已开源供后续研究使用。


Let's Get You Hired: A Job Seeker's Perspective on Multi-Agent Recruitment Systems for Explaining Hiring Decisions

Abstract

arXiv:2505.20312v1 Announce Type: cross Abstract: During job recruitment, traditional applicant selection methods often lack transparency. Candidates are rarely given sufficient justifications for recruiting decisions, whether they are made manually by human recruiters or through the use of black-box Applicant Tracking Systems (ATS). To address this problem, our work introduces a multi-agent AI system that uses Large Language Models (LLMs) to guide job seekers during the recruitment process. Using an iterative user-centric design approach, we first conducted a two-phased exploratory study with four active job seekers to inform the design and development of the system. Subsequently, we conducted an in-depth, qualitative user study with 20 active job seekers through individual one-to-one interviews to evaluate the developed prototype. The results of our evaluation demonstrate that participants perceived our multi-agent recruitment system as significantly more actionable, trustworthy, and fair compared to traditional methods. Our study further helped us uncover in-depth insights into factors contributing to these perceived user experiences. Drawing from these insights, we offer broader design implications for building user-aligned, multi-agent explainable AI systems across diverse domains.

摘要

在招聘过程中,传统的求职者筛选方法往往缺乏透明度。无论是通过人工招聘还是使用黑箱求职者追踪系统(ATS),候选人都很少能获得充分的录用决策理由。为解决这一问题,本研究引入了一个基于大语言模型(LLMs)的多智能体人工智能系统,用于在招聘过程中为求职者提供指导。通过迭代式的以用户为中心设计方法,我们首先对四名活跃求职者进行了两阶段探索性研究,为系统设计和开发提供依据。随后,我们通过一对一深度访谈的形式,对20名活跃求职者开展了定性用户研究以评估开发的原型系统。评估结果表明,与传统方法相比,参与者认为我们的多智能体招聘系统具有显著更高的可操作性、可信度和公平性。研究还帮助我们深入揭示了影响这些用户体验感知的关键因素。基于这些发现,我们提出了更广泛的设计启示,为跨领域构建用户导向的多智能体可解释人工智能系统提供了参考。


Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL

Abstract

arXiv:2505.20315v1 Announce Type: cross Abstract: Translating natural language into SQL (Test2SQL) is a longstanding challenge at the intersection of natural language understanding and structured data access. While large language models (LLMs) have significantly improved fluency in SQL generation, producing correct and executable SQL--particularly for complex queries--remains a bottleneck. We present Arctic-Text2SQL-R1, a reinforcement learning (RL) framework and model family designed to generate accurate, executable SQL using a lightweight reward signal based solely on execution correctness. Our approach avoids brittle intermediate supervision and complex reward shaping, promoting stable training and alignment with the end task. Combined with carefully curated data, strong supervised initialization, and effective training practices, Arctic-Text2SQL-R1 achieves state-of-the-art execution accuracy across six diverse Test2SQL benchmarks, including the top position on the BIRD leaderboard. Notably, our 7B model outperforms prior 70B-class systems, highlighting the framework's scalability and efficiency. We further demonstrate inference-time robustness through simple extensions like value retrieval and majority voting. Extensive experiments and ablation studies offer both positive and negative insights, providing practical guidance for future Test2SQL research.

摘要

将自然语言转换为SQL(Text2SQL)是自然语言理解与结构化数据访问交叉领域的一项长期挑战。尽管大型语言模型(LLM)显著提升了SQL生成的流畅性,但生成正确且可执行的SQL——尤其是复杂查询——仍是瓶颈。我们提出Arctic-Text2SQL-R1,这是一个基于强化学习(RL)的框架和模型系列,通过仅依赖执行正确性的轻量级奖励信号来生成准确、可执行的SQL。该方法避免了脆弱的中间监督和复杂的奖励塑造,促进稳定训练并与最终任务对齐。结合精心构建的数据、强监督初始化和高效训练实践,Arctic-Text2SQL-R1在六项不同的Text2SQL基准测试中实现了最先进的执行准确率,包括BIRD排行榜首位。值得注意的是,我们的70亿参数模型性能超越此前700亿级系统,凸显了该框架的可扩展性和效率。通过值检索和多数投票等简单扩展,我们进一步验证了推理阶段的鲁棒性。大量实验与消融研究提供了正反两方面的洞见,为未来Text2SQL研究提供了实践指导。


Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms

Abstract

arXiv:2505.20322v1 Announce Type: cross Abstract: Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering. However, these applications have been limited to toy tasks owing to the nontrivial issue of locating atomic knowledge components. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis reveals that steering exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply the steering strategy to the large reasoning model, confirming its effectiveness in precise reasoning control.

摘要

对语言模型生成过程的精确控制对确保安全性和可靠性至关重要。尽管提示工程和导向技术常被用于干预模型行为,但模型中庞大的参数量往往导致内部表征高度交织。这种相互依赖性会限制控制精度,有时甚至引发意外副作用。近期研究探索利用稀疏自编码器(SAE)在高维空间中解耦知识以实现导向,但由于定位原子知识组件的非平凡性问题,这些应用仅限于玩具任务。本文提出导向目标原子(STA)这一新方法,通过隔离和操纵解耦的知识组件来增强安全性。综合实验验证了该方法的有效性。进一步分析表明,导向技术展现出卓越的鲁棒性和灵活性,尤其在对抗场景中。我们还将该导向策略应用于大型推理模型,证实其在精确推理控制中的有效性。


Less Context, Same Performance: A RAG Framework for Resource-Efficient LLM-Based Clinical NLP

Abstract

arXiv:2505.20320v1 Announce Type: cross Abstract: Long text classification is challenging for Large Language Models (LLMs) due to token limits and high computational costs. This study explores whether a Retrieval Augmented Generation (RAG) approach using only the most relevant text segments can match the performance of processing entire clinical notes with large context LLMs. We begin by splitting clinical documents into smaller chunks, converting them into vector embeddings, and storing these in a FAISS index. We then retrieve the top 4,000 words most pertinent to the classification query and feed these consolidated segments into an LLM. We evaluated three LLMs (GPT4o, LLaMA, and Mistral) on a surgical complication identification task. Metrics such as AUC ROC, precision, recall, and F1 showed no statistically significant differences between the RAG based approach and whole-text processing (p > 0.05p > 0.05). These findings indicate that RAG can significantly reduce token usage without sacrificing classification accuracy, providing a scalable and cost effective solution for analyzing lengthy clinical documents.

摘要

长文本分类对大型语言模型(LLMs)而言存在挑战,主要受限于标记长度和高计算成本。本研究探讨了基于检索增强生成(RAG)的方法——仅使用最相关文本片段——是否能够达到大型上下文LLMs处理完整临床笔记的性能表现。我们首先将临床文档分割为较小片段,将其转化为向量嵌入并存储于FAISS索引中。随后检索与分类查询最相关的4,000个单词,并将这些整合后的片段输入LLM。我们在手术并发症识别任务上评估了三种LLM(GPT4o、LLaMA和Mistral)。AUC ROC、精确率、召回率和F1值等指标显示,基于RAG的方法与全文处理方式无统计学显著差异(p > 0.05)。这些结果表明,RAG能在不牺牲分类准确性的前提下显著减少标记使用量,为分析冗长临床文档提供了可扩展且经济高效的解决方案。


Evaluating the Energy-Efficiency of the Code Generated by LLMs

Abstract

arXiv:2505.20324v1 Announce Type: cross Abstract: As the quality of code generated by Large Language Models (LLMs) improves, their adoption in the software industry for automated code generation continues to grow. Researchers primarily focus on enhancing the functional correctness of the generated code while commonly overlooking its energy efficiency and environmental impact. This paper investigates the energy efficiency of the code generated by 20 popular LLMs for 878 programming problems of varying difficulty levels and diverse algorithmic categories selected from the LeetCode platform by comparing them against canonical human-written solutions. Although LLMs can produce functionally correct results in most cases, our findings show that the performance and energy efficiency of LLM-produced solutions are often far below those of human-written solutions. Among the studied LLMs, DeepSeek-v3 and GPT-4o generate the most energy-efficient code, whereas Grok-2 and Gemini-1.5-Pro are among the least energy-efficient models. On average, human-generated canonical solutions are approximately 1.17 times more energy efficient than DeepSeek-v3, 1.21 times more energy efficient than GPT-4o, and over 2 times more energy efficient than Grok-2 and Gemini-1.5-Pro. For specific algorithmic groups such as dynamic programming, backtracking, and bit manipulation, LLM-generated code can consume up to 450 times more energy than human-generated canonical solutions.

摘要

随着大型语言模型(LLM)生成代码质量的提升,其在软件行业自动化代码生成中的应用持续扩大。研究者主要关注提升生成代码的功能正确性,而普遍忽视了其能源效率与环境影响。本文通过比较20个主流LLM针对LeetCode平台选出的878个不同难度等级及多样算法类别编程问题生成的代码与标准人工编写解决方案,系统研究了LLM生成代码的能源效率。尽管LLM在多数情况下能生成功能正确的结果,但研究发现LLM生成解决方案的性能和能源效率往往远低于人工编写的解决方案。在研究的LLM中,DeepSeek-v3和GPT-4o生成的代码能源效率最高,而Grok-2和Gemini-1.5-Pro位列能效最低的模型。平均而言,人工生成的标准解决方案能效比DeepSeek-v3高约1.17倍,比GPT-4o高1.21倍,比Grok-2和Gemini-1.5-Pro高出2倍以上。对于动态规划、回溯和位运算等特定算法类别,LLM生成的代码能耗可达人工标准解决方案的450倍。


Multi-Scale Manifold Alignment: A Unified Framework for Enhanced Explainability of Large Language Models

Abstract

arXiv:2505.20333v1 Announce Type: cross Abstract: Recent advances in Large Language Models (LLMs) have achieved strong performance, yet their internal reasoning remains opaque, limiting interpretability and trust in critical applications. We propose a novel Multi_Scale Manifold Alignment framework that decomposes the latent space into global, intermediate, and local semantic manifolds capturing themes, context, and word-level details. Our method introduces cross_scale mapping functions that jointly enforce geometric alignment (e.g., Procrustes analysis) and information preservation (via mutual information constraints like MINE or VIB). We further incorporate curvature regularization and hyperparameter tuning for stable optimization. Theoretical analysis shows that alignment error, measured by KL divergence, can be bounded under mild assumptions. This framework offers a unified explanation of how LLMs structure multi-scale semantics, advancing interpretability and enabling applications such as bias detection and robustness enhancement.

摘要

尽管大语言模型(LLMs)近期取得显著进展,但其内部推理机制仍不透明,这限制了关键应用中模型的可解释性与可信度。我们提出一种新颖的多尺度流形对齐框架,将潜在空间分解为全局、中间和局部语义流形,分别捕获主题、上下文和词汇级细节。该方法通过跨尺度映射函数联合实施几何对齐(如Procrustes分析)与信息保留(通过MINE或VIB等互信息约束),并引入曲率正则化和超参数调优以实现稳定优化。理论分析表明,在温和假设下,以KL散度度量的对齐误差存在上界。该框架为LLMs如何组织多尺度语义提供了统一解释,推动了可解释性研究,并支持偏见检测和鲁棒性增强等应用。


Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

Abstract

arXiv:2505.20325v1 Announce Type: cross Abstract: Test-Time Scaling (TTS) methods for enhancing Large Language Model (LLM) reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.

摘要

提升大语言模型(LLM)推理能力的测试时缩放(TTS)方法通常需要高昂计算成本,主要源于对外部过程奖励模型(PRM)或'最佳N采样'(BoN)等方法的过度依赖。本文提出'直觉引导'(GG)框架,这是一种高效的自引导TTS方法,无需昂贵的外部验证模型即可达到PRM级别的性能。该方法仅通过LLM内部信号(词元级置信度和步骤新颖性)驱动轻量级树搜索实现。关键创新在于通过定向强化学习微调阶段提升内部置信度估计的可靠性。在复杂数学推理基准测试中,GG使小规模模型(如15亿参数)达到或超越超大模型(如320-700亿参数)的准确率,同时降低GPU内存使用达10倍。与基于PRM的方法相比,GG在保持相当准确性的前提下,推理速度提升8倍,内存占用减少4-5倍。此外,相较于BoN策略,GG将KV缓存内存使用降低约50%,为TTS技术提供了更高效实用的部署方案。


Dynamic Manifold Evolution Theory: Modeling and Stability Analysis of Latent Representations in Large Language Models

Abstract

arXiv:2505.20340v1 Announce Type: cross Abstract: We introduce Dynamic Manifold Evolution Theory (DMET),a unified framework that models large language model generation as a controlled dynamical system evolving on a low_dimensional semantic manifold. By casting latent_state updates as discrete time Euler approximations of continuous dynamics, we map intrinsic energy_driven flows and context_dependent forces onto Transformer components (residual connections, attention, feed-forward networks). Leveraging Lyapunov stability theory We define three empirical metrics (state continuity, clustering quality, topological persistence) that quantitatively link latent_trajectory properties to text fluency, grammaticality, and semantic coherence. Extensive experiments across decoding parameters validate DMET's predictions and yield principled guidelines for balancing creativity and consistency in text generation.

摘要

我们提出动态流形演化理论(DMET),这是一个将大语言模型生成过程建模为低维语义流形上受控动力系统的统一框架。通过将隐状态更新表述为连续动力学系统的离散时间欧拉近似,我们将能量驱动的内在流与上下文相关力映射到Transformer组件(残差连接、注意力机制、前馈网络)。基于李雅普诺夫稳定性理论,我们定义了三个实证指标(状态连续性、聚类质量、拓扑持久性),定量揭示了隐轨迹特性与文本流畅度、语法规范性及语义连贯性之间的关联。跨解码参数的大规模实验验证了DMET的理论预测,并为平衡文本生成中的创造性与一致性提供了原则性指导。


Beyond Demonstrations: Dynamic Vector Construction from Latent Representations

Abstract

arXiv:2505.20318v1 Announce Type: cross Abstract: In-Context derived Vector (ICV) methods extract task-relevant representations from large language models (LLMs) and reinject them during inference, achieving comparable performance to few-shot In-Context Learning (ICL) without repeated demonstration processing. However, existing ICV methods remain sensitive to ICL-specific factors, often use coarse or semantically fragmented representations as the source of the vector, and rely on heuristic-based injection positions, limiting their applicability. To address these issues, we propose Dynamic Vector (DyVec), which incorporates an Exhaustive Query Rotation (EQR) strategy to extract robust semantically aggregated latent representations by mitigating variance introduced by ICL. It then applies Dynamic Latent Segmentation and Injection to adaptively partition representations based on task complexity and leverages REINFORCE-based optimization to learn optimal injection positions for each segment. Experiments results show that DyVec outperforms few-shot ICL, LoRA, and prior ICV baselines. Further analysis highlights the effectiveness of dynamically segmenting and injecting semantically aggregated latent representations. DyVec provides a lightweight and data-efficient solution for inference-time task adaptation.

摘要

上下文推导向量(ICV)方法从大型语言模型(LLMs)中提取任务相关表征并在推理阶段重新注入,无需重复演示处理即可实现与少量样本上下文学习(ICL)相当的性能。然而现有ICV方法仍对ICL特定因素敏感,通常使用粗糙或语义碎片化的表征作为向量来源,且依赖基于启发式的注入位置,限制了其适用性。为解决这些问题,我们提出动态向量(DyVec),其采用穷尽查询旋转(EQR)策略通过减少ICL引入的方差来提取鲁棒的语义聚合潜在表征。随后应用动态潜在分割与注入技术,根据任务复杂度自适应划分表征,并利用基于REINFORCE的优化方法学习每个片段的最佳注入位置。实验结果表明DyVec优于少量样本ICL、LoRA及现有ICV基线方法。进一步分析凸显了动态分割与注入语义聚合潜在表征的有效性。DyVec为推理时任务适应提供了轻量级且数据高效的解决方案。


Language Model Distillation: A Temporal Difference Imitation Learning Perspective

Abstract

arXiv:2505.20335v1 Announce Type: cross Abstract: Large language models have led to significant progress across many NLP tasks, although their massive sizes often incur substantial computational costs. Distillation has become a common practice to compress these large and highly capable models into smaller, more efficient ones. Many existing language model distillation methods can be viewed as behavior cloning from the perspective of imitation learning or inverse reinforcement learning. This viewpoint has inspired subsequent studies that leverage (inverse) reinforcement learning techniques, including variations of behavior cloning and temporal difference learning methods. Rather than proposing yet another specific temporal difference method, we introduce a general framework for temporal difference-based distillation by exploiting the distributional sparsity of the teacher model. Specifically, it is often observed that language models assign most probability mass to a small subset of tokens. Motivated by this observation, we design a temporal difference learning framework that operates on a reduced action space (a subset of vocabulary), and demonstrate how practical algorithms can be derived and the resulting performance improvements.

摘要

大型语言模型已在众多自然语言处理任务中取得显著进展,但其庞大规模往往带来高昂计算成本。知识蒸馏作为压缩这些高性能大模型为更高效小模型的常见方法,现有许多语言模型蒸馏技术可从模仿学习或逆强化学习视角视为行为克隆。这一观点启发了后续研究采用(逆)强化学习技术,包括行为克隆变体与时序差分学习方法。不同于提出另一种具体时序差分方法,我们通过利用教师模型的分布稀疏性,提出了基于时序差分的通用蒸馏框架。具体而言,语言模型通常将大部分概率质量集中于少量词汇的现象启发了我们:通过设计在缩减动作空间(词汇子集)上操作的时序差分学习框架,不仅展示了实用算法的推导过程,同时验证了由此带来的性能提升。


MOSLIM:Align with diverse preferences in prompts through reward classification

Abstract

arXiv:2505.20336v1 Announce Type: cross Abstract: The multi-objective alignment of Large Language Models (LLMs) is essential for ensuring foundational models conform to diverse human preferences. Current research in this field typically involves either multiple policies or multiple reward models customized for various preferences, or the need to train a preference-specific supervised fine-tuning (SFT) model. In this work, we introduce a novel multi-objective alignment method, MOSLIM, which utilizes a single reward model and policy model to address diverse objectives. MOSLIM provides a flexible way to control these objectives through prompting and does not require preference training during SFT phase, allowing thousands of off-the-shelf models to be directly utilized within this training framework. MOSLIM leverages a multi-head reward model that classifies question-answer pairs instead of scoring them and then optimize policy model with a scalar reward derived from a mapping function that converts classification results from reward model into reward scores. We demonstrate the efficacy of our proposed method across several multi-objective benchmarks and conduct ablation studies on various reward model sizes and policy optimization methods. The MOSLIM method outperforms current multi-objective approaches in most results while requiring significantly fewer GPU computing resources compared with existing policy optimization methods.

摘要

大型语言模型(LLMs)的多目标对齐对于确保基础模型符合多样化人类偏好至关重要。当前该领域研究通常需要针对不同偏好定制多个策略或奖励模型,或需训练特定偏好的监督微调(SFT)模型。本研究提出一种新颖的多目标对齐方法MOSLIM,该方法仅需单一奖励模型和策略模型即可应对多样化目标。MOSLIM通过提示机制灵活控制这些目标,且无需在SFT阶段进行偏好训练,使得数千个现成模型可直接应用于该训练框架。该方法采用多头奖励模型对问答对进行分类而非评分,随后通过映射函数将分类结果转化为奖励分值,基于此标量奖励优化策略模型。我们在多个多目标基准测试中验证了所提方法的有效性,并对不同规模的奖励模型及策略优化方法进行了消融研究。结果表明,MOSLIM方法在多数测试中优于现有多目标方案,且相比现有策略优化方法可显著减少GPU计算资源消耗。


Do LLMs have a Gender (Entropy) Bias?

Abstract

arXiv:2505.20343v1 Announce Type: cross Abstract: We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across four key domains in business and health contexts: education, jobs, personal financial management, and general health. We define and study entropy bias, which we define as a discrepancy in the amount of information generated by an LLM in response to real questions users have asked. We tested this using four different LLMs and evaluated the generated responses both qualitatively and quantitatively by using ChatGPT-4o (as "LLM-as-judge"). Our analyses (metric-based comparisons and "LLM-as-judge" evaluation) suggest that there is no significant bias in LLM responses for men and women at a category level. However, at a finer granularity (the individual question level), there are substantial differences in LLM responses for men and women in the majority of cases, which "cancel" each other out often due to some responses being better for males and vice versa. This is still a concern since typical users of these tools often ask a specific question (only) as opposed to several varied ones in each of these common yet important areas of life. We suggest a simple debiasing approach that iteratively merges the responses for the two genders to produce a final result. Our approach demonstrates that a simple, prompt-based debiasing strategy can effectively debias LLM outputs, thus producing responses with higher information content than both gendered variants in 78% of the cases, and consistently achieving a balanced integration in the remaining cases.

摘要

我们研究了一些流行大语言模型(LLM)中特定类型性别偏见的存在与持续性,并贡献了一个新的基准数据集RealWorldQuestioning(发布于HuggingFace平台)。该数据集源自商业和健康四大关键领域的真实世界问题:教育、就业、个人财务管理和总体健康。我们定义并研究了熵偏差,即LLM针对用户真实提问所生成信息量的性别差异。通过测试四个不同LLM,我们采用ChatGPT-4o作为"LLM即评判者",对生成响应进行了定性与定量评估。分析结果表明(基于指标的对比和"LLM即评判者"评估),在类别层面上LLM对男女的响应不存在显著偏差。然而在更精细的粒度上(单个问题层面),大多数情况下LLM对男女的响应存在实质性差异,这些差异常因部分回答对男性更有利而相互抵消(反之亦然)。这仍值得关注,因为这些工具的典型用户通常只提出具体问题,而非在生活这些常见却重要的领域提出多个不同问题。我们提出了一种简单的去偏方法,通过迭代合并两性响应来生成最终结果。该方法表明,基于提示的简单去偏策略能有效消除LLM输出偏差,在78%的情况下生成比两种性别变体信息量更高的响应,并在其余情况下始终实现平衡整合。


Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query

Abstract

arXiv:2505.20334v1 Announce Type: cross Abstract: Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 \sim 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.

摘要

大语言模型(LLMs)依赖键值缓存(KV缓存)通过减少冗余计算来加速解码。然而,KV缓存的内存使用量随文本序列长度增长而显著增加,对高效部署构成挑战。现有KV缓存淘汰方法基于预填充阶段的注意力分数剪枝令牌,导致与实际推理查询存在不一致性,尤其在严格内存限制下。本文提出前瞻查询缓存(LAQ),该新型淘汰框架通过生成低成本的伪前瞻查询,以更准确地逼近真实解码阶段查询。通过将这些前瞻查询作为重要性估计的观察窗口,LAQ实现了与真实推理场景更一致、更精准的KV缓存淘汰。在LongBench和Needle-in-a-Haystack基准测试上的实验结果表明,LAQ在不同预算水平下均优于现有方法,在有限缓存预算下使LongBench指标提升1~4分。此外,LAQ与现有方法具有互补性,可灵活结合以取得进一步改进。


Rethinking Text-based Protein Understanding: Retrieval or LLM?

Abstract

arXiv:2505.20354v1 Announce Type: cross Abstract: In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.

摘要

近年来,蛋白质-文本模型因其在蛋白质生成与理解领域的潜力而备受关注。现有方法主要通过持续预训练和多模态对齐,将蛋白质相关知识整合至大语言模型中,使其能够同时理解文本描述与蛋白质序列。通过对现有模型架构和基于文本的蛋白质理解基准进行全面分析,我们发现当前基准测试中存在显著的数据泄露问题。此外,源自自然语言处理的传统评估指标无法准确衡量模型在该领域的性能。针对这些局限性,我们重组现有数据集并提出基于生物实体的新型评估框架。基于研究发现的启发,我们提出一种检索增强方法,该方法在蛋白质到文本生成任务中显著优于微调后的大语言模型,并在免训练场景下展现出卓越的准确性与效率。代码与数据详见https://github.com/IDEA-XL/RAPM。


SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data

Abstract

arXiv:2505.20347v1 Announce Type: cross Abstract: Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning(SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at https://github.com/wantbook-book/SeRL.

摘要

近期研究表明,强化学习(RL)能有效提升大语言模型(LLMs)的推理能力。然而现有方法不可避免地依赖于高质量指令和可验证奖励进行有效训练,这两者在专业领域中往往难以获取。本文提出自对弈强化学习(SeRL),通过有限初始数据实现LLM训练的自主引导。具体而言,SeRL包含两个互补模块:自指令生成与自奖励机制。前者基于每个训练步骤的可用数据生成增量指令,并采用强健的在线过滤策略确保指令质量、多样性和难度;后者引入简单而有效的多数投票机制来评估增量指令的响应奖励,无需外部标注。最终,SeRL基于生成数据执行常规强化学习,实现迭代式自对弈训练。在多种推理基准测试和不同LLM主干模型上的大量实验表明,所提SeRL方法的性能优于同类方案,且能达到与可验证奖励的高质量数据训练相当的成果。代码已开源:https://github.com/wantbook-book/SeRL。


Assessing the Capability of LLMs in Solving POSCOMP Questions

Abstract

arXiv:2505.20338v1 Announce Type: cross Abstract: Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT-4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models' proficiency in handling complex questions typical of the exam. LLM performance was notably better on text-based questions than on image interpretation tasks. In the 2022 exam, ChatGPT-4 led with 57 correct answers out of 69 questions, followed by Gemini 1.0 Advanced (49), Le Chat Mistral (48), and Claude 3 Sonnet (44). Similar trends were observed in the 2023 exam. ChatGPT-4 achieved the highest performance, surpassing all students who took the POSCOMP 2023 exam. LLMs, particularly ChatGPT-4, show promise in text-based tasks on the POSCOMP exam, although image interpretation remains a challenge. Given the rapid evolution of LLMs, we expanded our analysis to include more recent models - o1, Gemini 2.5 Pro, Claude 3.7 Sonnet, and o3-mini-high - evaluated on the 2022-2024 POSCOMP exams. These newer models demonstrate further improvements and consistently surpass both the average and top-performing human participants across all three years.

摘要

大型语言模型(LLM)的最新进展显著拓展了人工智能在自然语言处理任务中的能力。然而,其在计算机科学等专业领域的表现仍缺乏深入探究。评估LLM在这些领域的熟练度对于衡量其实用价值及指导未来发展至关重要。由巴西计算机学会(SBC)主办的POSCOMP考试作为计算机科学研究生入学权威测评,为此提供了理想基准。本研究探讨LLM能否在该考试中达到或超越人类水平。我们首先评估了ChatGPT-4、Gemini 1.0 Advanced、Claude 3 Sonnet和Le Chat Mistral Large四款模型在2022-2023年POSCOMP考试中的表现,重点考察其处理典型复杂试题的能力。结果显示:LLM在文本类题目表现显著优于图像解析任务。2022年考试中,ChatGPT-4以69题答对57题领先,Gemini 1.0 Advanced(49题)、Le Chat Mistral(48题)和Claude 3 Sonnet(44题)次之;2023年考试呈现相似趋势,ChatGPT-4更超越所有人类考生。这表明尽管存在图像解析短板,ChatGPT-4等模型在POSCOMP文本类任务中展现潜力。鉴于LLM的快速迭代,我们进一步评估了o1、Gemini 2.5 Pro、Claude 3.7 Sonnet和o3-mini-high等新模型在2022-2024年考试中的表现。这些新型号在所有三年考试中均持续超越人类考生平均及最优成绩,显示出持续的性能提升。


Risk-aware Direct Preference Optimization under Nested Risk Measure

Abstract

arXiv:2505.20359v1 Announce Type: cross Abstract: When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model's intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy while suppressing the deviation between a trained model and the reference model using a sequential risk ratio, thereby enhancing the model's risk-awareness. Experimental results across three open-source datasets: IMDb Dataset, Anthropic HH Dataset, and AlpacaEval, demonstrate the proposed method's superior performance in balancing alignment performance and model drift. Our code is opensourced at https://github.com/zlj123-max/Ra-DPO.

摘要

在对预训练大语言模型(LLMs)进行微调以使其与人类价值观和意图对齐时,最大化估计奖励虽能提升性能,但也会因偏离参考模型的预期行为而引入潜在风险。现有方法通常通过引入KL散度来约束训练模型与参考模型之间的偏差,然而在需要严格风险控制的应用场景中,这种方法可能不足。本文提出风险感知直接偏好优化(Ra-DPO),该方法通过采用一类嵌套风险度量来引入风险感知机制。该方案首先构建一个带约束的风险感知优势函数最大化问题,随后将Bradley-Terry模型转化为词元级表示。目标函数在最大化策略似然的同时,通过序列风险比率抑制训练模型与参考模型之间的偏差,从而增强模型的风险感知能力。在IMDb数据集、Anthropic HH数据集和AlpacaEval三个开源数据集上的实验结果表明,所提方法在平衡对齐性能与模型漂移方面具有优越性能。我们的代码已开源:https://github.com/zlj123-max/Ra-DPO。


GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Abstract

arXiv:2505.20355v1 Announce Type: cross Abstract: Low-Rank Adaptation (LoRA) is a popular method for parameter-efficient fine-tuning (PEFT) of generative models, valued for its simplicity and effectiveness. Despite recent enhancements, LoRA still suffers from a fundamental limitation: overfitting when the bottleneck is widened. It performs best at ranks 32-64, yet its accuracy stagnates or declines at higher ranks, still falling short of full fine-tuning (FFT) performance. We identify the root cause as LoRA's structural bottleneck, which introduces gradient entanglement to the unrelated input channels and distorts gradient propagation. To address this, we introduce a novel structure, Granular Low-Rank Adaptation (GraLoRA) that partitions weight matrices into sub-blocks, each with its own low-rank adapter. With negligible computational or storage cost, GraLoRA overcomes LoRA's limitations, effectively increases the representational capacity, and more closely approximates FFT behavior. Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank settings, making GraLoRA a scalable and robust solution for PEFT. Code, data, and scripts are available at https://github.com/SqueezeBits/GraLoRA.git

摘要

低秩自适应(LoRA)是一种流行的参数高效微调(PEFT)方法,因其简单高效而备受推崇。尽管近期有所改进,LoRA仍存在一个根本性局限:当瓶颈宽度增加时会出现过拟合现象。该方法在秩为32-64时表现最佳,但在更高秩时准确率停滞或下降,仍无法达到全参数微调(FFT)的性能水平。我们发现其根本原因在于LoRA的结构性瓶颈——该结构会向无关输入通道引入梯度纠缠,并扭曲梯度传播过程。为此,我们提出了一种新颖的结构:细粒度低秩自适应(GraLoRA),该结构将权重矩阵划分为多个子块,每个子块配备独立的低秩适配器。在计算和存储成本可忽略的前提下,GraLoRA克服了LoRA的局限性,有效提升了表征能力,更逼近FFT的行为特性。在代码生成和常识推理基准测试中,GraLoRA始终优于LoRA及其他基线方法,在HumanEval+上实现了最高+8.5%的Pass@1绝对增益。这些改进在不同模型规模和秩设置下均保持稳定,使GraLoRA成为可扩展且鲁棒的PEFT解决方案。代码、数据及脚本详见https://github.com/SqueezeBits/GraLoRA.git


Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents

Abstract

arXiv:2505.20368v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts, and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.

摘要

基于检索增强生成(RAG)的大语言模型(LLM)因其在知识密集型任务中的卓越表现而被广泛应用于金融领域。然而标准化文件(如SEC备案)具有相似的格式特征,包括重复的样板文本和相近的表格结构。这种相似性导致传统RAG方法容易误判近似重复文本,引发重复检索问题,进而影响结果的准确性和完整性。为解决这些问题,我们提出证据分层检索(HiREC)框架。该方法首先通过分层检索减少相似文本的混淆:先检索相关文档,再从文档中筛选最相关段落。证据整理过程会剔除无关段落,并在必要时自动生成补充查询以收集缺失信息。为评估模型性能,我们构建并发布了开放域金融大规模问答基准(LOFin),包含145,897份SEC文档和1,595组问答对。代码与数据详见https://github.com/deep-over/LOFin-bench-HiREC。


LEGO-Compiler: Enhancing Neural Compilation Through Translation Composability

Abstract

arXiv:2505.20356v1 Announce Type: cross Abstract: Large language models (LLMs) have the potential to revolutionize how we design and implement compilers and code translation tools. However, existing LLMs struggle to handle long and complex programs. We introduce LEGO-Compiler, a novel neural compilation system that leverages LLMs to translate high-level languages into assembly code. Our approach centers on three key innovations: LEGO translation, which decomposes the input program into manageable blocks; breaking down the complex compilation process into smaller, simpler verifiable steps by organizing it as a verifiable LLM workflow by external tests; and a feedback mechanism for self-correction. Supported by formal proofs of translation composability, LEGO-Compiler demonstrates high accuracy on multiple datasets, including over 99% on ExeBench and 97.9% on industrial-grade AnsiBench. Additionally, LEGO-Compiler has also acheived near one order-of-magnitude improvement on compilable code size scalability. This work opens new avenues for applying LLMs to system-level tasks, complementing traditional compiler technologies.

摘要

大语言模型(LLMs)有望彻底改变我们设计和实现编译器及代码翻译工具的方式。然而,现有LLMs难以处理长而复杂的程序。我们提出LEGO-Compiler,一种新型神经编译系统,利用LLMs将高级语言翻译为汇编代码。该方法围绕三大核心创新:LEGO翻译(将输入程序分解为可管理的代码块)、通过外部测试将复杂编译过程组织为可验证的LLM工作流(从而将其拆分为更小更简单的可验证步骤),以及用于自我修正的反馈机制。在翻译组合性形式化证明的支持下,LEGO-Compiler在多个数据集上展现出高准确率(ExeBench超过99%,工业级AnsiBench达97.9%),并实现了近一个数量级的可编译代码规模扩展性提升。这项工作为LLMs应用于系统级任务开辟了新途径,对传统编译器技术形成了有力补充。


What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Abstract

arXiv:2505.20405v1 Announce Type: cross Abstract: Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.

摘要

基于指令的图像编辑模型为生成任务提供了更高的个性化可能性。然而,如何准确评估其效果仍具挑战性,现有的大多数指标在与人判断一致性和可解释性方面存在不足。为解决这些问题,我们提出了DICE(差异一致性评估器),该模型旨在检测原始图像与编辑图像之间的局部差异,并评估这些差异与修改请求的相关性。DICE包含两个关键组件:差异检测器和一致性评估器,二者均基于自回归多模态大语言模型(MLLM)构建,并通过结合自监督学习、修复网络蒸馏和全监督学习的策略进行训练。通过大量实验,我们评估了流程的每个阶段,并在所提框架内比较了不同的MLLM。实验表明,DICE能有效识别连贯的编辑结果,对不同编辑模型生成的图像进行评估时,与人判断具有高度相关性。我们公开了源代码、模型及数据。


In-context Language Learning for Endangered Languages in Speech Recognition

Abstract

arXiv:2505.20445v1 Announce Type: cross Abstract: With approximately 7,000 languages spoken worldwide, current large language models (LLMs) support only a small subset. Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). With experiments on four diverse endangered languages that LLMs have not been trained on, we find that providing more relevant text samples enhances performance in both language modelling and Automatic Speech Recognition (ASR) tasks. Furthermore, we show that the probability-based approach outperforms the traditional instruction-based approach in language learning. Lastly, we show ICL enables LLMs to achieve ASR performance that is comparable to or even surpasses dedicated language models trained specifically for these languages, while preserving the original capabilities of the LLMs.

摘要

全球现存约7000种语言,而当前大型语言模型(LLMs)仅支持其中一小部分。已有研究表明,LLMs能够在无监督数据的情况下通过学习完成某些任务的新语言习得。本研究将这一探索延伸至语音识别领域,重点考察LLMs是否能够通过上下文学习(ICL)掌握未见过的低资源语言。我们在四种LLMs未训练过的濒危语言上开展实验,发现提供更多相关文本样本能有效提升语言建模和自动语音识别(ASR)任务的性能。此外,实验证明基于概率的方法在语言学习方面优于传统基于指令的方法。最后,我们发现ICL能使LLMs达到与专用语言模型相当的ASR性能,甚至在某些情况下实现超越,同时完整保留LLMs的原有能力。


Holes in Latent Space: Topological Signatures Under Adversarial Influence

Abstract

arXiv:2505.20435v1 Announce Type: cross Abstract: Understanding how adversarial conditions affect language models requires techniques that capture both global structure and local detail within high-dimensional activation spaces. We propose persistent homology (PH), a tool from topological data analysis, to systematically characterize multiscale latent space dynamics in LLMs under two distinct attack modes -- backdoor fine-tuning and indirect prompt injection. By analyzing six state-of-the-art LLMs, we show that adversarial conditions consistently compress latent topologies, reducing structural diversity at smaller scales while amplifying dominant features at coarser ones. These topological signatures are statistically robust across layers, architectures, model sizes, and align with the emergence of adversarial effects deeper in the network. To capture finer-grained mechanisms underlying these shifts, we introduce a neuron-level PH framework that quantifies how information flows and transforms within and across layers. Together, our findings demonstrate that PH offers a principled and unifying approach to interpreting representational dynamics in LLMs, particularly under distributional shift.

摘要

理解对抗性条件如何影响语言模型需要能够捕捉高维激活空间中全局结构和局部细节的技术。我们提出持续同调(PH)这一拓扑数据分析工具,用于系统表征两种不同攻击模式下(后门微调与间接提示注入)大语言模型的多尺度潜在空间动态。通过分析六个最先进的大语言模型,我们发现对抗性条件会持续压缩潜在拓扑结构,在较小尺度上减少结构多样性,同时在较粗尺度上放大主导特征。这些拓扑特征在模型各层、架构、规模间均具有统计稳健性,并与网络深层对抗效应的出现相吻合。为捕捉这些转变背后更精细的机制,我们提出神经元级别的PH框架,量化信息在层内及层间的流动与转换过程。综合而言,我们的研究证明PH为解释大语言模型表征动态(尤其在分布偏移条件下)提供了一种原则性、统一性的分析方法。


GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation

Abstract

arXiv:2505.20416v1 Announce Type: cross Abstract: Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.

摘要

大型语言模型(LLM)的微调通常需要大量高质量监督数据,这些数据的获取成本高昂且耗费人力。尽管合成数据生成已成为一种有前景的解决方案,但现有方法常存在事实错误、长尾覆盖不足、知识结构过于简化以及输出同质化等问题。为解决这些挑战,我们提出了GraphGen——一个基于知识图谱引导的框架,专为三类关键问答(QA)场景设计:原子问答、聚合问答和多跳问答。该框架首先从源文本构建细粒度知识图谱,随后通过预期校准误差指标识别LLM的知识盲区,优先生成针对高价值长尾知识的问答对。此外,GraphGen采用多跳邻域采样以捕捉复杂关系信息,并利用风格控制生成技术实现问答数据的多样化。在闭卷设置下的知识密集型任务实验中,GraphGen的表现优于传统合成数据方法,为监督微调中的数据稀缺问题提供了更可靠、更全面的解决方案。代码与数据已公开于https://github.com/open-sciencelab/GraphGen。


SEMMA: A Semantic Aware Knowledge Graph Foundation Model

Abstract

arXiv:2505.20422v1 Announce Type: cross Abstract: Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. SEMMA leverages Large Language Models (LLMs) to enrich relation identifiers, generating semantic embeddings that subsequently form a textual relation graph, which is fused with the structural component. Across 54 diverse KGs, SEMMA outperforms purely structural baselines like ULTRA in fully inductive link prediction. Crucially, we show that in more challenging generalization settings, where the test-time relation vocabulary is entirely unseen, structural methods collapse while SEMMA is 2x more effective. Our findings demonstrate that textual semantics are critical for generalization in settings where structure alone fails, highlighting the need for foundation models that unify structural and linguistic signals in knowledge reasoning.

摘要

知识图谱基础模型(KGFMs)通过学习可迁移模式,在未见图谱的零样本推理任务中展现出潜力。然而现有大多数KGFMs仅依赖图结构,忽视了文本属性中编码的丰富语义信号。本文提出SEMMA——一种双模块KGFM框架,系统性地整合可迁移文本语义与图结构。该模型利用大语言模型(LLMs)增强关系标识符,生成语义嵌入并构建文本关系图,进而与结构组件进行融合。在54个多样化知识图谱上的实验表明,SEMMA在全归纳链路预测任务中优于ULTRA等纯结构基线模型。关键发现是:在测试阶段关系词汇完全未知的更具挑战性的泛化场景中,结构方法完全失效,而SEMMA仍能保持2倍效能。本研究证实当结构信息失效时,文本语义对泛化能力具有决定性作用,凸显了知识推理领域需要统一结构与语言信号的基础模型。


Robot Operation of Home Appliances by Reading User Manuals

Abstract

arXiv:2505.20424v1 Announce Type: cross Abstract: Operating home appliances, among the most common tools in every household, is a critical capability for assistive home robots. This paper presents ApBot, a robot system that operates novel household appliances by "reading" their user manuals. ApBot faces multiple challenges: (i) infer goal-conditioned partial policies from their unstructured, textual descriptions in a user manual document, (ii) ground the policies to the appliance in the physical world, and (iii) execute the policies reliably over potentially many steps, despite compounding errors. To tackle these challenges, ApBot constructs a structured, symbolic model of an appliance from its manual, with the help of a large vision-language model (VLM). It grounds the symbolic actions visually to control panel elements. Finally, ApBot closes the loop by updating the model based on visual feedback. Our experiments show that across a wide range of simulated and real-world appliances, ApBot achieves consistent and statistically significant improvements in task success rate, compared with state-of-the-art large VLMs used directly as control policies. These results suggest that a structured internal representations plays an important role in robust robot operation of home appliances, especially, complex ones.

摘要

操作家用电器是辅助家庭机器人最关键的技能之一,这些电器是每个家庭中最常见的工具。本文提出ApBot系统,该机器人通过"阅读"用户手册来操作新型家用电器。ApBot面临多重挑战:(i) 从用户手册的非结构化文本描述中推断目标导向的部分策略,(ii) 将这些策略在物理世界中与电器建立关联,(iii) 尽管存在累积误差,仍能可靠地执行可能包含多个步骤的策略。为解决这些挑战,ApBot在大型视觉语言模型(VLM)的协助下,从手册构建电器的结构化符号模型,并通过视觉将符号动作与控制面板元素进行关联。最后,ApBot通过基于视觉反馈更新模型来实现闭环控制。实验表明,在多种模拟和真实电器上,与直接作为控制策略的最先进大型VLM相比,ApBot在任务成功率上实现了具有统计学意义的显著提升。这些结果表明,结构化内部表征对于机器人稳健操作家用电器(尤其是复杂电器)具有重要作用。


Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models

Abstract

arXiv:2505.20569v1 Announce Type: cross Abstract: Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.

摘要

尽管大规模视觉语言模型取得了显著进展,物体幻觉(OH)仍然是持续存在的挑战。基于先前关于对比解码的研究(该方法无需额外模型训练即可解决此问题),我们提出RVCD(检索视觉对比解码)这一抑制OH的先进方法。RVCD在逻辑层同时利用负样本和正样本图像,明确参考旨在表示单一概念的AI生成图像。我们的方法相较现有基于解码的方案展现出显著改进。


InFact: Informativeness Alignment for Improved LLM Factuality

Abstract

arXiv:2505.20487v1 Announce Type: cross Abstract: Factual completeness is a general term that captures how detailed and informative a factually correct text is. For instance, the factual sentence Barack Obama was born in the United States'' is factually correct, though less informative than the factual sentence Barack Obama was born in Honolulu, Hawaii, United States''. Despite the known fact that LLMs tend to hallucinate and generate factually incorrect text, they might also tend to choose to generate factual text that is indeed factually correct and yet less informative than other, more informative choices. In this work, we tackle this problem by proposing an informativeness alignment mechanism. This mechanism takes advantage of recent factual benchmarks to propose an informativeness alignment objective. This objective prioritizes answers that are both correct and informative. A key finding of our work is that when training a model to maximize this objective or optimize its preference, we can improve not just informativeness but also factuality.

摘要

事实完整性是一个概括性术语,用于描述事实正确的文本在细节和信息量上的丰富程度。例如,事实性陈述"巴拉克·奥巴马出生于美国"虽然正确,但信息量不及"巴拉克·奥巴马出生于美国夏威夷檀香山"这一事实陈述。尽管已知大型语言模型容易产生幻觉并生成事实错误的文本,但它们也可能倾向于选择生成那些事实正确但信息量不足的文本。本研究通过提出信息量对齐机制来解决这一问题。该机制利用最新的事实性基准测试,构建了信息量对齐目标函数,该目标优先选择既正确又信息丰富的答案。我们的关键发现表明:当训练模型以最大化该目标或优化其偏好时,不仅能提升信息量,还能改善事实准确性。


Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism

Abstract

arXiv:2505.20500v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used in decision-making tasks like r'esum'e screening and content moderation, giving them the power to amplify or suppress certain perspectives. While previous research has identified disability-related biases in LLMs, little is known about how they conceptualize ableism or detect it in text. We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals. We examine the gap between their understanding of relevant terminology and their effectiveness in recognizing ableist content in context. Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations. Further, we conduct a qualitative comparison of human and LLM explanations. We find that LLMs tend to rely on surface-level keyword matching, leading to context misinterpretations, in contrast to human annotators who consider context, speaker identity, and potential impact. On the other hand, both LLMs and humans agree on the annotation scheme, suggesting that a binary classification is adequate for evaluating LLM performance, which is consistent with findings from prior studies involving human annotators.

摘要

大型语言模型(LLMs)在简历筛选和内容审核等决策任务中的应用日益广泛,使其具备放大或压制特定观点的能力。尽管先前研究已发现LLMs存在与残疾相关的偏见,但关于其如何概念化健全主义或识别文本中的此类内容仍知之甚少。本研究评估了四种LLMs识别针对自闭症个体的微妙健全主义的能力,通过考察模型对相关术语的理解与语境中识别健全主义内容的效果之间的差距。结果表明,LLMs能够识别自闭症相关语言,但常忽略其中有害或冒犯的隐含意义。此外,我们对人类与LLMs的解释进行了定性比较,发现LLMs倾向于依赖表层关键词匹配,导致语境误读;而人类标注者则会综合考虑上下文、说话者身份及潜在影响。另一方面,LLMs与人类在标注方案上表现一致,表明二元分类足以评估LLMs性能,这与先前涉及人类标注者的研究结论相符。


Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review

Abstract

arXiv:2505.20503v1 Announce Type: cross Abstract: Rapid advancements in foundation models, including Large Language Models, Vision-Language Models, Multimodal Large Language Models, and Vision-Language-Action Models have opened new avenues for embodied AI in mobile service robotics. By combining foundation models with the principles of embodied AI, where intelligent systems perceive, reason, and act through physical interactions, robots can improve understanding, adapt to, and execute complex tasks in dynamic real-world environments. However, embodied AI in mobile service robots continues to face key challenges, including multimodal sensor fusion, real-time decision-making under uncertainty, task generalization, and effective human-robot interactions (HRI). In this paper, we present the first systematic review of the integration of foundation models in mobile service robotics, identifying key open challenges in embodied AI and examining how foundation models can address them. Namely, we explore the role of such models in enabling real-time sensor fusion, language-conditioned control, and adaptive task execution. Furthermore, we discuss real-world applications in the domestic assistance, healthcare, and service automation sectors, demonstrating the transformative impact of foundation models on service robotics. We also include potential future research directions, emphasizing the need for predictive scaling laws, autonomous long-term adaptation, and cross-embodiment generalization to enable scalable, efficient, and robust deployment of foundation models in human-centric robotic systems.

摘要

基础模型(包括大语言模型、视觉语言模型、多模态大语言模型和视觉语言动作模型)的快速发展为移动服务机器人中的具身人工智能开辟了新途径。通过将基础模型与具身AI原理(即智能系统通过物理交互实现感知、推理与行动)相结合,机器人能够增强对动态现实环境的理解、适应及复杂任务执行能力。然而,移动服务机器人的具身AI仍面临多模态传感器融合、不确定性下的实时决策、任务泛化以及有效人机交互(HRI)等关键挑战。本文首次系统综述了基础模型在移动服务机器人中的集成应用,指出具身AI领域的核心开放挑战,并探讨基础模型应对这些挑战的路径。具体而言,我们分析了此类模型在实现实时传感器融合、语言条件控制及自适应任务执行中的作用。此外,我们研究了家庭辅助、医疗保健和服务自动化领域的实际应用案例,揭示基础模型对服务机器人技术的变革性影响。最后提出未来研究方向,强调需要建立预测性扩展定律、实现自主长期适应以及跨具身泛化能力,以推动基础模型在以人为中心的机器人系统中实现可扩展、高效且稳健的部署。


Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning

Abstract

arXiv:2505.20561v1 Announce Type: cross Abstract: Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as backtracking and error correction. However, conventional Markovian RL confines exploration to the training phase to learn an optimal deterministic policy and depends on the history contexts only through the current state. Therefore, it remains unclear whether reflective reasoning will emerge during Markovian RL training, or why they are beneficial at test time. To remedy this, we recast reflective exploration within the Bayes-Adaptive RL framework, which explicitly optimizes the expected return under a posterior distribution over Markov decision processes. This Bayesian formulation inherently incentivizes both reward-maximizing exploitation and information-gathering exploration via belief updates. Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on the observed outcomes, offering principled guidance on when and how the model should reflectively explore. Empirical results on both synthetic and mathematical reasoning tasks demonstrate that BARL outperforms standard Markovian RL approaches at test time, achieving superior token efficiency with improved exploration effectiveness. Our code is available at https://github.com/shenao-zhang/BARL.

摘要

基于强化学习(RL)训练的大语言模型(LLMs)已展现出强大的推理能力和涌现的反思行为,如回溯与错误修正。然而,传统马尔可夫强化学习将探索限制在训练阶段以学习最优确定性策略,且仅通过当前状态依赖历史上下文。因此,反思性推理是否会在马尔可夫强化学习训练过程中涌现,或其为何在测试阶段具有优势,仍不明确。为解决这一问题,我们在贝叶斯自适应强化学习框架中重新构建了反思性探索,该框架显式优化了马尔可夫决策过程后验分布下的期望回报。这种贝叶斯公式通过信念更新,内在激励了奖励最大化的利用与信息收集的探索。我们提出的BARL算法指导大语言模型根据观测结果拼接和切换策略,为模型何时及如何进行反思性探索提供了原则性指导。在合成任务与数学推理任务上的实验结果表明,BARL在测试阶段优于标准马尔可夫强化学习方法,通过提升探索效率实现了更优的标记效率。代码已发布于https://github.com/shenao-zhang/BARL。


Collision- and Reachability-Aware Multi-Robot Control with Grounded LLM Planners

Abstract

arXiv:2505.20573v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated strong performance in various robot control tasks. However, their deployment in real-world applications remains constrained. Even state-ofthe-art LLMs, such as GPT-o4mini, frequently produce invalid action plans that violate physical constraints, such as directing a robot to an unreachable location or causing collisions between robots. This issue primarily arises from a lack of awareness of these physical constraints during the reasoning process. To address this issue, we propose a novel framework that integrates reinforcement learning with verifiable rewards (RLVR) to incentivize knowledge of physical constraints into LLMs to induce constraints-aware reasoning during plan generation. In this approach, only valid action plans that successfully complete a control task receive positive rewards. We applied our method to two small-scale LLMs: a non-reasoning Qwen2.5-3B-Instruct and a reasoning Qwen3-4B. The experiment results demonstrate that constraint-aware small LLMs largely outperform large-scale models without constraints, grounded on both the BoxNet task and a newly developed BoxNet3D environment built using MuJoCo. This work highlights the effectiveness of grounding even small LLMs with physical constraints to enable scalable and efficient multi-robot control in complex, physically constrained environments.

摘要

大型语言模型(LLMs)在各种机器人控制任务中展现出强大性能,但其在实际应用中的部署仍受限制。即便是GPT-4mini等最先进的LLMs,也常生成违反物理约束的无效动作计划,例如指挥机器人到达不可达位置或引发机器人碰撞。该问题主要源于推理过程中缺乏对这些物理约束的认知。为解决此问题,我们提出一种融合强化学习与可验证奖励(RLVR)的新框架,通过将物理约束知识编码至LLMs,引导其在计划生成时进行约束感知推理。该方法仅对成功完成控制任务的有效动作计划给予正向奖励。我们将该方法应用于两个小规模LLMs:非推理型Qwen2.5-3B-Instruct与推理型Qwen3-4B。实验结果表明,在基于BoxNet任务和MuJoCo构建的新开发BoxNet3D环境中,具备约束感知能力的小型LLMs显著优于无约束的大规模模型。这项工作证明,通过对小型LLMs进行物理约束 grounding,可在复杂物理约束环境中实现可扩展且高效的多机器人控制。


REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

Abstract

arXiv:2505.20613v1 Announce Type: cross Abstract: Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).

摘要

如今,形式化定理证明器在中学及竞赛级数学领域已取得显著进展,但鲜有系统能推广至高等数学领域。本文提出REAL-Prover——一个基于Lean 4的新型开源逐步定理证明器,旨在突破这一界限。该证明器基于我们微调的大型语言模型(REAL-Prover-v1)并与检索系统(Leansearch-PS)集成,显著提升了大学阶段数学问题的求解能力。为训练REAL-Prover-v1,我们开发了HERALD-AF数据提取流程(可将自然语言数学问题转化为形式化陈述)及新型开源Lean 4交互环境(Jixia-interactive)以促进合成数据收集。实验表明,仅采用监督微调的证明器在ProofNet数据集上达到23.7%的成功率(Pass@64),与最先进(SOTA)模型相当。为进一步评估方法,我们提出聚焦代数问题的新基准FATE-M,在此测试中我们的证明器以56.7%的成功率(Pass@64)创下SOTA记录。


Effectiveness of Prompt Optimization in NL2SQL Systems

Abstract

arXiv:2505.20591v1 Announce Type: cross Abstract: NL2SQL approaches have greatly benefited from the impressive capabilities of large language models (LLMs). In particular, bootstrapping an NL2SQL system for a specific domain can be as simple as instructing an LLM with sufficient contextual information, such as schema details and translation demonstrations. However, building an accurate system still requires the rigorous task of selecting the right context for each query-including identifying relevant schema elements, cell values, and suitable exemplars that help the LLM understand domain-specific nuances. Retrieval-based methods have become the go-to approach for identifying such context. While effective, these methods introduce additional inference-time costs due to the retrieval process. In this paper, we argue that production scenarios demand high-precision, high-performance NL2SQL systems, rather than simply high-quality SQL generation, which is the focus of most current NL2SQL approaches. In such scenarios, the careful selection of a static set of exemplars-capturing the intricacies of the query log, target database, SQL constructs, and execution latencies-plays a more crucial role than exemplar selection based solely on similarity. The key challenge, however, lies in identifying a representative set of exemplars for a given production setting. To this end, we propose a prompt optimization framework that not only addresses the high-precision requirement but also optimizes the performance of the generated SQL through multi-objective optimization. Preliminary empirical analysis demonstrates the effectiveness of the proposed framework.

摘要

NL2SQL方法从大语言模型(LLM)的强大能力中获益良多。特别是,为特定领域快速构建NL2SQL系统可以像为LLM提供足够的上下文信息(如模式细节和翻译示例)那样简单。然而,构建一个准确的系统仍然需要为每个查询精心选择正确的上下文——包括识别相关的模式元素、单元格值以及有助于LLM理解领域特定细微差别的合适示例。基于检索的方法已成为识别此类上下文的首选方法。虽然有效,但这些方法由于检索过程而引入了额外的推理时间成本。在本文中,我们认为生产场景需要高精度、高性能的NL2SQL系统,而不仅仅是高质量的SQL生成(这是当前大多数NL2SQL方法的重点)。在此类场景中,精心选择一组静态示例(捕捉查询日志、目标数据库、SQL结构和执行延迟的复杂性)比仅基于相似性选择示例起着更关键的作用。然而,关键挑战在于为给定的生产环境确定一组具有代表性的示例。为此,我们提出了一个提示优化框架,不仅满足高精度要求,还通过多目标优化来优化生成SQL的性能。初步实证分析证明了所提出框架的有效性。


Can Past Experience Accelerate LLM Reasoning?

Abstract

arXiv:2505.20643v1 Announce Type: cross Abstract: Allocating more compute to large language models (LLMs) reasoning has generally been demonstrated to improve their effectiveness, but also results in increased inference time. In contrast, humans can perform tasks faster and better with increased experience and exposure. Hence, this paper aims to investigate the question: Can LLMs also become faster at reasoning through recurrent exposure on relevant tasks, and if so, how can it be achieved? To address these questions, we first formalize the problem setting of LLM reasoning speedup systematically in the dimensions of task relevancy and compute budget calculation. We then propose SpeedupLLM, a theoretically guaranteed framework to implement and benchmark such reasoning speedup behaviour based on adaptive compute allocation and memory mechanisms. We further conduct comprehensive experiments to benchmark such behaviour across different question similarity levels, memory methods, and reasoning methods. Results show that LLMs can generally reason faster with past experience, achieving up to a 56% reduction in compute cost when equipped with appropriate memory and reasoning methods.

摘要

增加大型语言模型(LLM)推理的计算资源通常被证明能提升其效能,但也会导致推理时间延长。相比之下,人类通过经验积累和重复接触可以更快更好地完成任务。因此,本文旨在探究:LLM是否也能通过相关任务的重复接触加快推理速度?若可行,如何实现?针对这些问题,我们首先从任务相关性和计算预算分配维度系统化地形式化了LLM推理加速的问题设定。随后提出SpeedupLLM框架,该框架基于自适应计算分配与记忆机制,以理论保证的方式实现并评估此类推理加速行为。我们进一步通过全面实验,在不同问题相似度、记忆方法和推理方法下对比了该行为的性能表现。结果表明,LLM在具备过往经验时普遍能实现更快推理,当配备合适的记忆与推理方法时,计算成本最高可降低56%。


SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation

Abstract

arXiv:2505.20622v1 Announce Type: cross Abstract: We present Sequential Policy Optimization for Simultaneous Machine Translation (SeqPO-SiMT), a new policy optimization framework that defines the simultaneous machine translation (SiMT) task as a sequential decision making problem, incorporating a tailored reward to enhance translation quality while reducing latency. In contrast to popular Reinforcement Learning from Human Feedback (RLHF) methods, such as PPO and DPO, which are typically applied in single-step tasks, SeqPO-SiMT effectively tackles the multi-step SiMT task. This intuitive framework allows the SiMT LLMs to simulate and refine the SiMT process using a tailored reward. We conduct experiments on six datasets from diverse domains for En to Zh and Zh to En SiMT tasks, demonstrating that SeqPO-SiMT consistently achieves significantly higher translation quality with lower latency. In particular, SeqPO-SiMT outperforms the supervised fine-tuning (SFT) model by 1.13 points in COMET, while reducing the Average Lagging by 6.17 in the NEWSTEST2021 En to Zh dataset. While SiMT operates with far less context than offline translation, the SiMT results of SeqPO-SiMT on 7B LLM surprisingly rival the offline translation of high-performing LLMs, including Qwen-2.5-7B-Instruct and LLaMA-3-8B-Instruct.

摘要

我们提出了一种用于同声传译机器翻译(SiMT)的序列策略优化框架(SeqPO-SiMT),该框架将同声传译任务定义为序列决策问题,并通过定制化奖励机制在提升翻译质量的同时降低延迟。与普遍应用于单步任务的人类反馈强化学习(RLHF)方法(如PPO和DPO)不同,SeqPO-SiMT能有效处理多步同声传译任务。这一直观框架使得大语言模型(LLM)能够利用定制化奖励模拟并优化同声传译过程。我们在六个跨领域数据集上进行了英汉/汉英同声传译实验,结果表明SeqPO-SiMT始终能以更低延迟实现显著更优的翻译质量。具体而言,在NEWSTEST2021英译汉数据集中,SeqPO-SiMT的COMET指标较监督微调(SFT)模型提升1.13分,同时将平均延迟(Average Lagging)降低6.17。值得注意的是,尽管同声传译可利用的上下文远少于离线翻译,但基于70亿参数大模型的SeqPO-SiMT系统,其翻译质量竟能媲美包括Qwen-2.5-7B-Instruct和LLaMA-3-8B-Instruct在内的高性能大模型的离线翻译效果。


Test-Time Learning for Large Language Models

Abstract

arXiv:2505.20633v1 Announce Type: cross Abstract: While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on domain knowledge adaptation.

摘要

尽管大型语言模型(LLMs)通过大规模预训练展现出卓越的涌现能力,但在泛化至专业领域及处理多样化语言变异(即分布偏移)方面仍存在关键局限。本文提出一种面向LLMs的测试时学习(TTL)范式——TLM,该范式仅利用测试阶段的未标注数据动态适配目标领域。具体而言,我们首先通过实证与理论分析揭示:最小化未标注测试数据的输入困惑度可实现LLMs更精准的预测。基于此发现,我们将LLMs的测试时学习过程形式化为输入困惑度最小化问题,从而实现模型性能的自监督增强。进一步地,我们发现高困惑度样本通常蕴含更丰富的模型优化信息,据此提出样本高效学习策略,主动筛选并侧重这些样本进行测试时更新。最后,为缓解灾难性遗忘并确保适配稳定性,采用低秩适配(LoRA)替代全参数优化,在实现轻量级模型更新的同时保留更多原始知识。我们构建了AdaptEval基准用于TTL评估,实验表明TLM在领域知识适配任务上较原始LLMs性能提升至少20%。


FinTagging: An LLM-ready Benchmark for Extracting and Structuring Financial Information

Abstract

arXiv:2505.20650v1 Announce Type: cross Abstract: We introduce FinTagging, the first full-scope, table-aware XBRL benchmark designed to evaluate the structured information extraction and semantic alignment capabilities of large language models (LLMs) in the context of XBRL-based financial reporting. Unlike prior benchmarks that oversimplify XBRL tagging as flat multi-class classification and focus solely on narrative text, FinTagging decomposes the XBRL tagging problem into two subtasks: FinNI for financial entity extraction and FinCL for taxonomy-driven concept alignment. It requires models to jointly extract facts and align them with the full 10k+ US-GAAP taxonomy across both unstructured text and structured tables, enabling realistic, fine-grained evaluation. We assess a diverse set of LLMs under zero-shot settings, systematically analyzing their performance on both subtasks and overall tagging accuracy. Our results reveal that, while LLMs demonstrate strong generalization in information extraction, they struggle with fine-grained concept alignment, particularly in disambiguating closely related taxonomy entries. These findings highlight the limitations of existing LLMs in fully automating XBRL tagging and underscore the need for improved semantic reasoning and schema-aware modeling to meet the demands of accurate financial disclosure. Code is available at our GitHub repository and data is at our Hugging Face repository.

摘要

我们推出首个全范围、表格感知的XBRL基准测试FinTagging,旨在评估大语言模型(LLM)在基于XBRL的财务报告中的结构化信息提取与语义对齐能力。与先前将XBRL标记简化为扁平多类分类且仅关注叙述性文本的基准不同,FinTagging将XBRL标记问题分解为两个子任务:用于金融实体提取的FinNI和用于分类法驱动概念对齐的FinCL。该基准要求模型联合提取事实并将其与完整的10,000+美国通用会计准则分类法进行对齐,涵盖非结构化文本和结构化表格,从而实现真实、细粒度的评估。我们在零样本设置下评估了多种LLM,系统分析了它们在子任务和整体标记准确率上的表现。结果表明,虽然LLM在信息提取方面展现出强大的泛化能力,但在细粒度概念对齐上存在困难,特别是在区分密切相关的分类法条目时。这些发现揭示了现有LLM在完全自动化XBRL标记方面的局限性,并强调需要改进语义推理和模式感知建模以满足精准财务披露的需求。代码发布于GitHub仓库,数据存放于Hugging Face仓库。


Self-Route: Automatic Mode Switching via Capability Estimation for Efficient Reasoning

Abstract

arXiv:2505.20664v1 Announce Type: cross Abstract: While reasoning-augmented large language models (RLLMs) significantly enhance complex task performance through extended reasoning chains, they inevitably introduce substantial unnecessary token consumption, particularly for simpler problems where Short Chain-of-Thought (Short CoT) suffices. This overthinking phenomenon leads to inefficient resource usage without proportional accuracy gains. To address this issue, we propose Self-Route, a dynamic reasoning framework that automatically selects between general and reasoning modes based on model capability estimation. Our approach introduces a lightweight pre-inference stage to extract capability-aware embeddings from hidden layer representations, enabling real-time evaluation of the model's ability to solve problems. We further construct Gradient-10K, a model difficulty estimation-based dataset with dense complexity sampling, to train the router for precise capability boundary detection. Extensive experiments demonstrate that Self-Route achieves comparable accuracy to reasoning models while reducing token consumption by 30-55% across diverse benchmarks. The proposed framework demonstrates consistent effectiveness across models with different parameter scales and reasoning paradigms, highlighting its general applicability and practical value.

摘要

虽然推理增强型大语言模型(RLLMs)通过延长推理链显著提升了复杂任务的表现,但它们不可避免地会带来大量不必要的token消耗——尤其在短思维链(Short CoT)即可解决的简单问题上。这种"过度思考"现象导致资源使用效率低下,却未带来相应的准确率提升。为此,我们提出Self-Route动态推理框架,该框架基于模型能力评估自动选择通用模式或推理模式。我们的方法引入轻量级预推理阶段,从隐藏层表征中提取能力感知嵌入向量,从而实时评估模型解决问题的能力。我们进一步构建Gradient-10K数据集,该数据集基于模型难度估计并采用密集复杂度采样,用于训练路由器实现精确的能力边界检测。大量实验表明,Self-Route在保持与推理模型相当准确率的同时,能在多样化基准测试中减少30-55%的token消耗。该框架在不同参数规模和推理范式的模型上均展现出稳定的有效性,体现了其普适性和实用价值。


Accelerating RL for LLM Reasoning with Optimal Advantage Regression

Abstract

arXiv:2505.20686v1 Announce Type: cross Abstract: Reinforcement learning (RL) has emerged as a powerful tool for fine-tuning large language models (LLMs) to improve complex reasoning abilities. However, state-of-the-art policy optimization methods often suffer from high computational overhead and memory consumption, primarily due to the need for multiple generations per prompt and the reliance on critic networks or advantage estimates of the current policy. In this paper, we propose AA-PO, a novel two-stage policy optimization framework that directly approximates the optimal advantage function and enables efficient training of LLMs for reasoning tasks. In the first stage, we leverage offline sampling from a reference policy to estimate the optimal value function VV, eliminating the need for costly online value estimation. In the second stage, we perform on-policy updates using a simple least-squares regression loss with only a single generation per prompt. Theoretically, we establish performance guarantees and prove that the KL-regularized RL objective can be optimized without requiring complex exploration strategies. Empirically, AA-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks, while reducing training time by up to 2×\times and peak memory usage by over 30% compared to PPO, GRPO, and REBEL. Implementation of AA-PO can be found at https://github.com/ZhaolinGao/A-PO.

摘要

强化学习(RL)已成为微调大语言模型(LLMs)以提升复杂推理能力的重要工具。然而,现有最先进的策略优化方法通常存在计算开销大、内存消耗高的问题,这主要源于每个提示需要多次生成样本以及对当前策略的评论网络或优势估计的依赖。本文提出AA-PO——一种新颖的两阶段策略优化框架,通过直接逼近最优优势函数实现LLMs在推理任务中的高效训练。第一阶段利用参考策略的离线采样估计最优价值函数VV,避免了昂贵的在线价值估计;第二阶段采用简单的最小二乘回归损失进行同策略更新,每个提示仅需单次生成。理论层面,我们建立了性能保证,证明KL正则化强化学习目标无需复杂探索策略即可实现优化。实验表明,相较于PPO、GRPO和REBEL方法,AA-PO在广泛数学推理基准测试中保持竞争力,同时将训练时间缩短至2imes imes以内,峰值内存占用降低超过30%。AA-PO的实现详见https://github.com/ZhaolinGao/A-PO。


What LLMs Miss in Recommendations: Bridging the Gap with Retrieval-Augmented Collaborative Signals

Abstract

arXiv:2505.20730v1 Announce Type: cross Abstract: User-item interactions contain rich collaborative signals that form the backbone of many successful recommender systems. While recent work has explored the use of large language models (LLMs) for recommendation, it remains unclear whether LLMs can effectively reason over this type of collaborative information. In this paper, we conduct a systematic comparison between LLMs and classical matrix factorization (MF) models to assess LLMs' ability to leverage user-item interaction data. We further introduce a simple retrieval-augmented generation (RAG) method that enhances LLMs by grounding their predictions in structured interaction data. Our experiments reveal that current LLMs often fall short in capturing collaborative patterns inherent to MF models, but that our RAG-based approach substantially improves recommendation quality-highlighting a promising direction for future LLM-based recommenders.

摘要

用户-物品交互数据蕴含丰富的协同信号,这些信号构成众多成功推荐系统的核心基础。尽管近期研究探索了大型语言模型(LLMs)在推荐系统中的应用,但LLMs能否有效推理此类协同信息仍不明确。本文通过系统比较LLMs与经典矩阵分解(MF)模型,评估LLMs利用用户-物品交互数据的能力。我们进一步提出一种简单的检索增强生成(RAG)方法,通过将LLMs的预测基于结构化交互数据来增强其性能。实验表明,当前LLMs在捕捉MF模型固有的协同模式方面存在不足,但基于RAG的方法能显著提升推荐质量——这为未来基于LLM的推荐系统指明了有前景的发展方向。


Pretraining Language Models to Ponder in Continuous Space

Abstract

arXiv:2505.20674v1 Announce Type: cross Abstract: Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Our method is straightforward and can be seamlessly integrated with various existing language models. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. For language modeling tasks, pondering language models achieve performance comparable to vanilla models with twice the number of parameters. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, pondering-enhanced Pythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. The code is available at https://github.com/LUMIA-Group/PonderingLM.

摘要

人类在表达复杂句子成分前会进行深思,通过集中精力实现更深层次的认知处理。本研究将这种深思过程引入语言模型,通过在单个标记生成步骤中重复调用前向过程来实现。在深思阶段,模型并非从预测分布中采样实际标记,而是根据预测的标记分布生成所有标记嵌入的加权和作为深思结果。生成的嵌入随后作为输入反馈给下一次前向传递。研究表明,通过自监督学习,模型能够学会这种深思方式,且无需任何人工标注。该方法简单直接,可与多种现有语言模型无缝集成。基于三种广泛使用的开源架构(GPT-2、Pythia和LLaMA)的实验及大量下游任务评估验证了方法的有效性和普适性。在语言建模任务中,深思语言模型的性能可与参数数量翻倍的原始模型相媲美。在9个下游基准测试中,经深思增强的Pythia模型显著优于官方Pythia模型。值得注意的是,深思增强的Pythia-1B模型性能接近TinyLlama-1.1B,而后者训练数据量是其10倍。代码已开源:https://github.com/LUMIA-Group/PonderingLM。


Dissecting Physics Reasoning in Small Language Models: A Multi-Dimensional Analysis from an Educational Perspective

Abstract

arXiv:2505.20707v1 Announce Type: cross Abstract: Small Language Models (SLMs) offer computational efficiency and accessibility, making them promising for educational applications. However, their capacity for complex reasoning, particularly in domains such as physics, remains underexplored. This study investigates the high school physics reasoning capabilities of state-of-the-art SLMs (under 4 billion parameters), including instruct versions of Llama 3.2, Phi 4 Mini, Gemma 3, and Qwen series. We developed a comprehensive physics dataset from the OpenStax High School Physics textbook, annotated according to Bloom's Taxonomy, with LaTeX and plaintext mathematical notations. A novel cultural contextualization approach was applied to a subset, creating culturally adapted problems for Asian, African, and South American/Australian contexts while preserving core physics principles. Using an LLM-as-a-judge framework with Google's Gemini 2.5 Flash, we evaluated answer and reasoning chain correctness, along with calculation accuracy. The results reveal significant differences between the SLMs. Qwen 3 1.7B achieved high answer accuracy' (85%), but fully correct reasoning' was substantially low (38%). The format of the mathematical notation had a negligible impact on performance. SLMs exhibited varied performance across the physics topics and showed a decline in reasoning quality with increasing cognitive and knowledge complexity. In particular, the consistency of reasoning was largely maintained in diverse cultural contexts, especially by better performing models. These findings indicate that, while SLMs can often find correct answers, their underlying reasoning is frequently flawed, suggesting an overreliance on pattern recognition. For SLMs to become reliable educational tools in physics, future development must prioritize enhancing genuine understanding and the generation of sound, verifiable reasoning chains over mere answer accuracy.

摘要

小语言模型(SLMs)凭借其计算高效性和易用性,在教育应用领域展现出广阔前景。然而,其在复杂推理(特别是物理等领域)的能力仍未得到充分探索。本研究调查了最先进SLMs(参数小于40亿)的高中物理推理能力,包括Llama 3.2、Phi 4 Mini、Gemma 3和Qwen系列的指令微调版本。我们从OpenStax高中物理教材构建了综合性物理数据集,依据布鲁姆分类法进行标注,并包含LaTeX与纯文本数学表达式。通过创新的文化情境化方法,我们创建了针对亚洲、非洲和南美/澳大利亚背景的文化适配问题子集,同时保留核心物理原理。采用谷歌Gemini 2.5 Flash作为评判框架,我们评估了答案正确性、推理链完整性及计算准确性。结果显示各SLMs存在显著差异:Qwen 3 1.7B获得较高"答案准确率"(85%),但"完全正确推理"比例极低(38%)。数学表达格式对性能影响可忽略。SLMs在不同物理主题表现各异,且随认知与知识复杂度提升,推理质量明显下降。值得注意的是,在多元文化情境中,尤其是性能较优模型能较好保持推理一致性。这些发现表明,尽管SLMs常能给出正确答案,但其底层推理往往存在缺陷,暗示其过度依赖模式识别。要使SLMs成为可靠的物理教育工具,未来发展必须着重提升真实理解能力与可验证的健全推理链生成,而非仅追求答案准确性。


In Context Learning with Vision Transformers: Case Study

Abstract

arXiv:2505.20872v1 Announce Type: cross Abstract: Large transformer models have been shown to be capable of performing in-context learning. By using examples in a prompt as well as a query, they are capable of performing tasks such as few-shot, one-shot, or zero-shot learning to output the corresponding answer to this query. One area of interest to us is that these transformer models have been shown to be capable of learning the general class of certain functions, such as linear functions and small 2-layer neural networks, on random data (Garg et al, 2023). We aim to extend this to the image space to analyze their capability to in-context learn more complex functions on the image space, such as convolutional neural networks and other methods.

摘要

大型Transformer模型已被证明能够进行上下文学习。通过使用提示中的示例和查询,它们能够执行少样本、单样本或零样本学习任务,从而输出该查询的相应答案。我们关注的一个领域是,这些Transformer模型已被证明能够在随机数据上学习特定函数的通用类别,例如线性函数和小型双层神经网络(Garg等,2023)。我们的目标是将此扩展到图像空间,以分析它们在图像空间中对更复杂函数(如卷积神经网络和其他方法)进行上下文学习的能力。


SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences

Abstract

arXiv:2505.20776v1 Announce Type: cross Abstract: Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), but its performance degrades on long inputs due to increased attention cost and reduced draft accuracy. We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences without any additional training. SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models, reducing latency across all stages. To improve draft accuracy and speed, we propose Cross-model Retrieval, a novel KV cache update strategy that uses the target model's attention scores to dynamically select relevant context for the draft model. Extensive evaluations on three long-context understanding datasets show that SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens, providing an effective solution for speculative decoding of long sequences. The code is available at https://github.com/jycha98/SpecExtend .

摘要

推测解码是一种广泛采用的加速大语言模型(LLM)推理的技术,但在长输入上其性能会因注意力成本增加和草稿准确性下降而降低。我们提出了SpecExtend,一种即插即用的增强方法,无需额外训练即可提升推测解码在长序列上的性能。SpecExtend将FlashAttention和混合树注意力等高效注意力机制集成到草稿模型和目标模型中,降低了所有阶段的延迟。为提高草稿准确性和速度,我们提出了跨模型检索,一种新颖的KV缓存更新策略,利用目标模型的注意力分数动态选择草稿模型的相关上下文。在三个长上下文理解数据集上的大量评估表明,对于长达16K标记的输入,SpecExtend将标准基于树的推测解码加速最高达2.22倍,为长序列的推测解码提供了有效解决方案。代码可在https://github.com/jycha98/SpecExtend获取。


FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone Navigation

Abstract

arXiv:2505.20783v1 Announce Type: cross Abstract: Path planning is a critical component in autonomous drone operations, enabling safe and efficient navigation through complex environments. Recent advances in foundation models, particularly large language models (LLMs) and vision-language models (VLMs), have opened new opportunities for enhanced perception and intelligent decision-making in robotics. However, their practical applicability and effectiveness in global path planning remain relatively unexplored. This paper proposes foundation model-guided path planners (FM-Planner) and presents a comprehensive benchmarking study and practical validation for drone path planning. Specifically, we first systematically evaluate eight representative LLM and VLM approaches using standardized simulation scenarios. To enable effective real-time navigation, we then design an integrated LLM-Vision planner that combines semantic reasoning with visual perception. Furthermore, we deploy and validate the proposed path planner through real-world experiments under multiple configurations. Our findings provide valuable insights into the strengths, limitations, and feasibility of deploying foundation models in real-world drone applications and providing practical implementations in autonomous flight. Project site: https://github.com/NTU-ICG/FM-Planner.

摘要

路径规划是自主无人机操作中的关键组成部分,能够实现复杂环境下的安全高效导航。基础模型(尤其是大语言模型LLMs和视觉语言模型VLMs)的最新进展,为机器人领域的增强感知与智能决策开辟了新机遇。然而这些模型在全局路径规划中的实际适用性与有效性仍缺乏充分探索。本文提出基础模型引导的路径规划器(FM-Planner),并针对无人机路径规划开展了全面的基准测试研究与实践验证。具体而言,我们首先通过标准化仿真场景系统评估了八种代表性LLM与VLM方法;为实现有效的实时导航,进而设计出融合语义推理与视觉感知的LLM-视觉集成规划器;最后通过多配置真实环境实验对所提路径规划器进行了部署验证。研究结果为基础模型在现实无人机应用中的优势、局限性和部署可行性提供了重要见解,并为自主飞行提供了实用实施方案。项目地址:https://github.com/NTU-ICG/FM-Planner。


PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

Abstract

arXiv:2505.20759v1 Announce Type: cross Abstract: Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.

摘要

现实世界中的物体由独特且专属于特定对象的部件构成。识别这些部件是实现细粒度组合推理的关键——然而,大型多模态模型(LMMs)在这一看似简单的任务上表现欠佳。本研究提出PARTONOMY基准测试,专为像素级部件定位设计。该基准整合现有部件数据集及我们严格标注的新图像集,共包含862个部件标签和534个物体标签用于评估。不同于仅要求模型识别通用部件的现有数据集,PARTONOMY采用专业概念(如农用飞机),要求模型比较物体部件、分析部件-整体关系,并通过视觉分割验证文本预测。实验表明,前沿LMMs存在显著局限(如LISA-13B仅获得5.9%广义交并比),凸显其部件定位能力的重大缺陷。我们发现现有支持分割的LMMs存在两大架构缺陷:使用预训练阶段未见的特殊[SEG]标记导致分布偏移,以及丢弃预测分割而非利用历史预测指导后续推理。针对这些问题,我们训练了多个部件中心化LMMs,并提出新型分割模型PLUM——采用跨度标记替代分割标记,并通过反馈循环整合历史预测。实验证明,预训练PLUM在推理分割、视觉问答和视觉幻觉基准测试中优于现有分割模型。经解释性部件分割任务微调后,PLUM与使用更丰富分割数据训练的模型性能相当。本研究为LMMs实现细粒度可验证的视觉理解开辟了新路径。


Bridging the Gap: Self-Optimized Fine-Tuning for LLM-based Recommender Systems

Abstract

arXiv:2505.20771v1 Announce Type: cross Abstract: Recent years have witnessed extensive exploration of Large Language Models (LLMs) on the field of Recommender Systems (RS). There are currently two commonly used strategies to enable LLMs to have recommendation capabilities: 1) The "Guidance-Only" strategy uses in-context learning to exploit and amplify the inherent semantic understanding and item recommendation capabilities of LLMs; 2) The "Tuning-Only" strategy uses supervised fine-tuning (SFT) to fine-tune LLMs with the aim of fitting them to real recommendation data. However, neither of these strategies can effectively bridge the gap between the knowledge space of LLMs and recommendation, and their performance do not meet our expectations. To better enable LLMs to learn recommendation knowledge, we combine the advantages of the above two strategies and proposed a novel "Guidance+Tuning" method called Self-Optimized Fine-Tuning (SOFT), which adopts the idea of curriculum learning. It first employs self-distillation to construct an auxiliary easy-to-learn but meaningful dataset from a fine-tuned LLM. Then it further utilizes a self-adaptive curriculum scheduler to enable LLMs to gradually learn from simpler data (self-distilled data) to more challenging data (real RS data). Extensive experiments demonstrate that SOFT significantly enhances the recommendation accuracy (37.59% on average) of LLM-based methods. The code is available via https://anonymous.4open.science/r/Self-Optimized-Fine-Tuning-264E

摘要

近年来,大型语言模型(LLMs)在推荐系统(RS)领域的应用得到了广泛探索。目前有两种常用策略使LLMs具备推荐能力:1)"仅引导"策略利用上下文学习来开发和放大LLMs固有的语义理解与项目推荐能力;2)"仅调优"策略通过监督微调(SFT)使LLMs适配真实推荐数据。然而这两种策略均无法有效弥合LLMs知识空间与推荐任务之间的鸿沟,其性能表现未达预期。为更好地使LLMs学习推荐知识,我们结合上述策略优势提出新型"引导+调优"方法——自优化微调(SOFT),该方法采用课程学习思想。首先通过自蒸馏从微调后的LLM构建辅助性的易学习且有意义的数据集,继而利用自适应课程调度器使LLMs实现从简单数据(自蒸馏数据)到复杂数据(真实RS数据)的渐进学习。大量实验表明,SOFT显著提升了基于LLM方法的推荐准确率(平均提升37.59%)。代码详见https://anonymous.4open.science/r/Self-Optimized-Fine-Tuning-264E


EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models

Abstract

arXiv:2505.20888v1 Announce Type: cross Abstract: In this paper, we present EasyDistill, a comprehensive toolkit designed for effective black-box and white-box knowledge distillation (KD) of large language models (LLMs). Our framework offers versatile functionalities, including data synthesis, supervised fine-tuning, ranking optimization, and reinforcement learning techniques specifically tailored for KD scenarios. The toolkit accommodates KD functionalities for both System 1 (fast, intuitive) and System 2 (slow, analytical) models. With its modular design and user-friendly interface, EasyDistill empowers researchers and industry practitioners to seamlessly experiment with and implement state-of-the-art KD strategies for LLMs. In addition, EasyDistill provides a series of robust distilled models and KD-based industrial solutions developed by us, along with the corresponding open-sourced datasets, catering to a variety of use cases. Furthermore, we describe the seamless integration of EasyDistill into Alibaba Cloud's Platform for AI (PAI). Overall, the EasyDistill toolkit makes advanced KD techniques for LLMs more accessible and impactful within the NLP community.

摘要

本文介绍EasyDistill工具包——一个专为大型语言模型(LLMs)设计的黑盒与白盒知识蒸馏(KD)综合工具。该框架提供多功能支持,包括针对KD场景特别优化的数据合成、监督微调、排序优化及强化学习技术,可同时支持系统1(快速直觉型)和系统2(缓慢分析型)模型的蒸馏功能。通过模块化设计和友好用户界面,EasyDistill助力研究者和从业者无缝实施LLMs前沿蒸馏策略。工具包还提供我们研发的系列强效蒸馏模型、基于KD的工业解决方案及配套开源数据集,覆盖多样化应用场景。此外,我们阐述了EasyDistill与阿里云机器学习平台PAI的无缝集成方案。总体而言,EasyDistill工具包显著提升了NLP领域对LLMs高级蒸馏技术的可及性与实践价值。


Abstract

arXiv:2505.20767v1 Announce Type: cross Abstract: Faithfulness hallucination are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standard, existing benchmarks only contain "factual statements" that rephrase source materials without marking "cognitive statements" that make inference from the given context, making the consistency evaluation and optimization of cognitive statements difficult. Inspired by how an evidence is assessed in the legislative domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and create a benchmark dataset where we reveal insightful statistics. We design an annotation pipeline to create larger benchmarks for different LLMs automatically, and the resulting larger-scale CogniBench-L dataset can be used to train accurate cognitive hallucination detection model. We release our model and dataset at: https://github.com/FUTUREEEEEE/CogniBench

摘要

忠实性幻觉是指大型语言模型(LLM)生成的、无法由所提供的上下文支持的论断。由于缺乏评估标准,现有基准仅包含对源材料进行改写的"事实性陈述",而未标记从给定上下文中进行推断的"认知性陈述",这使得认知性陈述的一致性评估和优化变得困难。受立法领域证据评估方法的启发,我们设计了一个严格的框架来评估认知性陈述的不同忠实度级别,并创建了一个基准数据集,其中揭示了具有洞察力的统计数据。我们设计了一个标注流程,以自动为不同LLM创建更大规模的基准,由此产生的大规模CogniBench-L数据集可用于训练精确的认知幻觉检测模型。我们的模型和数据集发布于:https://github.com/FUTUREEEEEE/CogniBench


Generalizable Heuristic Generation Through Large Language Models with Meta-Optimization

Abstract

arXiv:2505.20881v1 Announce Type: cross Abstract: Heuristic design with large language models (LLMs) has emerged as a promising approach for tackling combinatorial optimization problems (COPs). However, existing approaches often rely on manually predefined evolutionary computation (EC) optimizers and single-task training schemes, which may constrain the exploration of diverse heuristic algorithms and hinder the generalization of the resulting heuristics. To address these issues, we propose Meta-Optimization of Heuristics (MoH), a novel framework that operates at the optimizer level, discovering effective optimizers through the principle of meta-learning. Specifically, MoH leverages LLMs to iteratively refine a meta-optimizer that autonomously constructs diverse optimizers through (self-)invocation, thereby eliminating the reliance on a predefined EC optimizer. These constructed optimizers subsequently evolve heuristics for downstream tasks, enabling broader heuristic exploration. Moreover, MoH employs a multi-task training scheme to promote its generalization capability. Experiments on classic COPs demonstrate that MoH constructs an effective and interpretable meta-optimizer, achieving state-of-the-art performance across various downstream tasks, particularly in cross-size settings.

摘要

基于大语言模型(LLMs)的启发式设计已成为解决组合优化问题(COPs)的一种有前景的方法。然而,现有方法通常依赖于手动预定义的进化计算(EC)优化器和单任务训练方案,这可能会限制对多样化启发式算法的探索,并阻碍所得启发式的泛化能力。针对这些问题,我们提出了启发式元优化(MoH),这是一个在优化器层面操作的新框架,通过元学习原则发现有效的优化器。具体而言,MoH利用LLMs迭代精炼一个元优化器,该元优化器通过(自我)调用自主构建多样化的优化器,从而消除对预定义EC优化器的依赖。这些构建的优化器随后为下游任务演化启发式,实现更广泛的启发式探索。此外,MoH采用多任务训练方案以提升其泛化能力。在经典COPs上的实验表明,MoH构建了一个有效且可解释的元优化器,在各种下游任务中实现了最先进的性能,尤其在跨规模设置中表现突出。


Respond to Change with Constancy: Instruction-tuning with LLM for Non-I.I.D. Network Traffic Classification

Abstract

arXiv:2505.20866v1 Announce Type: cross Abstract: Encrypted traffic classification is highly challenging in network security due to the need for extracting robust features from content-agnostic traffic data. Existing approaches face critical issues: (i) Distribution drift, caused by reliance on the closedworld assumption, limits adaptability to realworld, shifting patterns; (ii) Dependence on labeled data restricts applicability where such data is scarce or unavailable. Large language models (LLMs) have demonstrated remarkable potential in offering generalizable solutions across a wide range of tasks, achieving notable success in various specialized fields. However, their effectiveness in traffic analysis remains constrained by challenges in adapting to the unique requirements of the traffic domain. In this paper, we introduce a novel traffic representation model named Encrypted Traffic Out-of-Distribution Instruction Tuning with LLM (ETooL), which integrates LLMs with knowledge of traffic structures through a self-supervised instruction tuning paradigm. This framework establishes connections between textual information and traffic interactions. ETooL demonstrates more robust classification performance and superior generalization in both supervised and zero-shot traffic classification tasks. Notably, it achieves significant improvements in F1 scores: APP53 (I.I.D.) to 93.19%(6.62%) and 92.11%(4.19%), APP53 (O.O.D.) to 74.88%(18.17%) and 72.13%(15.15%), and ISCX-Botnet (O.O.D.) to 95.03%(9.16%) and 81.95%(12.08%). Additionally, we construct NETD, a traffic dataset designed to support dynamic distributional shifts, and use it to validate ETooL's effectiveness under varying distributional conditions. Furthermore, we evaluate the efficiency gains achieved through ETooL's instruction tuning approach.

摘要

加密流量分类在网络安全领域极具挑战性,其难点在于需要从内容不可知的流量数据中提取鲁棒特征。现有方法面临两个关键问题:(1) 基于封闭世界假设导致的分布漂移现象,限制了模型对现实世界中动态变化模式的适应能力;(2) 对标注数据的依赖性使其在数据稀缺或缺失场景中应用受限。大型语言模型(LLMs)已展现出为广泛任务提供通用解决方案的卓越潜力,并在多个专业领域取得显著成功。然而,其在流量分析中的有效性仍受限于对流量领域特殊需求的适配挑战。本文提出新型流量表征模型ETooL(基于LLM的加密流量分布外指令调优模型),通过自监督指令调优范式将LLMs与流量结构知识相融合。该框架建立了文本信息与流量交互之间的关联,在监督学习和零样本流量分类任务中均表现出更鲁棒的分类性能和更优异的泛化能力。具体而言,其F1分数实现显著提升:APP53(同分布)达93.19%(提升6.62%)和92.11%(提升4.19%),APP53(分布外)达74.88%(提升18.17%)和72.13%(提升15.15%),ISCX-Botnet(分布外)达95.03%(提升9.16%)和81.95%(提升12.08%)。此外,我们构建了支持动态分布漂移的流量数据集NETD,用于验证ETooL在变化分布条件下的有效性,并评估了通过指令调优方法实现的效率提升。


Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation

Abstract

arXiv:2505.20897v1 Announce Type: cross Abstract: Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via \textit{language} form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is \href{https://github.com/zhangpingrui/Adaptive-Text-Dreamer}{here}.

摘要

视觉与语言导航(VLN)要求智能体在部分可观测环境下遵循自然语言指令进行导航,这使得感知与语言的对齐变得困难。现有方法通过想象未来场景来缓解这一问题,但这些方法依赖于基于视觉的合成,导致计算成本高昂且存在冗余细节。为此,我们提出通过语言形式自适应地想象关键环境语义,从而实现更可靠高效的策略。具体而言,我们引入了一种基于大语言模型(LLM)构建的双分支自引导想象策略——自适应文本梦境生成器(ATD)。该模型采用类人左右脑架构设计:左脑专注于逻辑整合,右脑负责对未来场景进行想象预测。我们仅对双脑中的Q-former进行微调,以高效激活LLM中的领域特定知识,实现导航过程中逻辑推理与想象能力的动态更新。此外,我们提出交叉交互机制来规范化想象输出,并将其注入导航专家模块,使ATD能够同时利用LLM的推理能力和导航模型的专长。在R2R基准测试上的大量实验表明,ATD以更少的参数实现了最先进的性能。代码详见此处。


Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties

Abstract

arXiv:2505.20875v1 Announce Type: cross Abstract: Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our \href{https://github.com/jiyounglee-0523/TransEnV}{code} and \href{https://huggingface.co/collections/jiyounglee0523/transenv-681eadb3c0c8cf363b363fb1}{datasets} are publicly available.

摘要

大型语言模型(LLMs)目前主要基于标准美国英语(SAE)进行评估,往往忽视了全球英语变体的多样性。这种狭隘的评估范围可能引发公平性问题,因为对非标准英语变体的性能下降会导致全球用户获得不平等的收益。因此,在多种非标准英语变体上全面评估LLMs的语言鲁棒性至关重要。我们提出了Trans-EnV框架,该框架能自动将SAE数据集转换为多种英语变体以评估语言鲁棒性。我们的框架结合了:(1)语言学专家知识,从语言学文献和语料库中整理特定变体的特征与转换规则;(2)基于LLM的转换方法,确保语言有效性与可扩展性。利用Trans-EnV,我们将六个基准数据集转换为38种英语变体,并评估了七种最先进的LLMs。结果显示存在显著的性能差异,非标准变体上的准确率最高下降46.3%。这些发现强调了跨多样英语变体进行综合语言鲁棒性评估的重要性。Trans-EnV的每个构建环节均通过严格的统计检验和二语习得领域研究人员的咨询验证,确保其语言学有效性。我们的代码和数据集已公开。


Multi-objective Large Language Model Alignment with Hierarchical Experts

Abstract

arXiv:2505.20925v1 Announce Type: cross Abstract: Aligning large language models (LLMs) to simultaneously satisfy multiple objectives remains a significant challenge, especially given the diverse and often conflicting nature of human preferences. Existing alignment methods struggle to balance trade-offs effectively, often requiring costly retraining or yielding suboptimal results across the Pareto frontier of preferences. In this paper, we introduce \textit{HoE}(Hierarchical Mixture-of-Experts), a \textit{lightweight}, \textit{parameter-efficient}, and \textit{plug-and-play} approach that eliminates the need for model training, while enabling LLMs to adapt across the entire Pareto frontier and accommodate diverse user preferences. In particular, \textit{HoE} consists of three hierarchical components: LoRA Experts, Router Experts and Preference Routing, reaching optimal Pareto frontiers and achieving a trade-off between parameter size, training cost, and performance. We evaluate \textit{HoE} across various tasks on 14 objectives and 200 different preferences among 6 benchmarks, demonstrating superior performance over 15 recent baselines. Code is available in the supplementary materials.

摘要

使大语言模型(LLM)同时满足多个目标的对齐问题仍然是一个重大挑战,尤其是在人类偏好具有多样性且往往相互冲突的情况下。现有对齐方法难以有效平衡这些权衡,通常需要昂贵的重新训练或在偏好帕累托前沿产生次优结果。本文提出\textit{HoE}(分层专家混合),这是一种\textit{轻量级}、\textit{参数高效}且\textit{即插即用}的方法,无需模型训练即可使LLM适应整个帕累托前沿并兼容多样化的用户偏好。具体而言,\textit{HoE}包含三个分层组件:LoRA专家、路由专家和偏好路由,能够达到最优帕累托前沿,并在参数量、训练成本和性能之间实现权衡。我们在6个基准测试中对14项目标和200种不同偏好进行了多任务评估,结果表明\textit{HoE}优于15种近期基线方法。代码详见补充材料。


An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks

Abstract

arXiv:2505.20854v1 Announce Type: cross Abstract: Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, other existing automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SWE-Judge, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SWE-Judge first defines five distinct evaluation strategies, each implemented as an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges to produce a final correctness score through ensembling. We evaluate SWE-Judge across a diverse set of software engineering (SE) benchmarks, including CoNaLa, Card2Code, HumanEval-X, APPS, APR-Assess, and Summary-Assess. These benchmarks span three SE tasks: code generation, automated program repair, and code summarization. Experimental results demonstrate that SWE-Judge consistently achieves a higher correlation with human judgments, with improvements ranging from 5.9% to 183.8% over existing automatic metrics. Furthermore, SWE-Judge reaches agreement levels with human annotators that are comparable to inter-annotator agreement in code generation and program repair tasks. These findings underscore SWE-Judge's potential as a scalable and reliable alternative to human evaluation.

摘要

大型语言模型(LLMs)及其他自动化技术正日益被用于支持软件开发人员生成代码片段、补丁和注释等软件制品。然而,如何准确评估这些生成制品的正确性仍是一个重大挑战。一方面,人工评估虽准确性高,但费时费力且缺乏可扩展性;另一方面,现有的自动评估指标虽具有可扩展性且人力需求低,却往往无法准确反映生成软件制品的实际正确性。

本文提出SWE-Judge,这是首个专为准确评估生成软件制品正确性而设计的LLM-as-Ensemble-Judge评估指标。SWE-Judge首先定义了五种不同的评估策略,每种策略均作为独立评判者实现。随后通过动态团队选择机制确定最合适的评判者子集,通过集成产生最终正确性评分。我们在包括CoNaLa、Card2Code、HumanEval-X、APPS、APR-Assess和Summary-Assess在内的多样化软件工程(SE)基准上评估SWE-Judge,这些基准涵盖代码生成、自动化程序修复和代码摘要三项SE任务。实验结果表明,SWE-Judge与人工评估结果的相关性持续优于现有自动指标,提升幅度达5.9%至183.8%。此外,在代码生成和程序修复任务中,SWE-Judge与人工标注者的一致性水平达到标注者间一致性相当的程度。这些发现表明SWE-Judge具备作为可扩展且可靠的人工评估替代方案的潜力。


Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models

Abstract

arXiv:2505.20921v1 Announce Type: cross Abstract: LLM providers typically offer multiple LLM tiers, varying in performance and price. As NLP tasks become more complex and modularized, selecting the suitable LLM tier for each subtask is a key challenge to balance between cost and performance. To address the problem, we introduce LLM Automatic Transmission (LLM-AT) framework that automatically selects LLM tiers without training. LLM-AT consists of Starter, Generator, and Judge. The starter selects the initial LLM tier expected to solve the given question, the generator produces a response using the LLM of the selected tier, and the judge evaluates the validity of the response. If the response is invalid, LLM-AT iteratively upgrades to a higher-tier model, generates a new response, and re-evaluates until a valid response is obtained. Additionally, we propose accuracy estimator, which enables the suitable initial LLM tier selection without training. Given an input question, accuracy estimator estimates the expected accuracy of each LLM tier by computing the valid response rate across top-k similar queries from past inference records. Experiments demonstrate that LLM-AT achieves superior performance while reducing costs, making it a practical solution for real-world applications.

摘要

大型语言模型(LLM)提供商通常提供多个性能与价格各异的LLM层级。随着自然语言处理任务日益复杂化和模块化,如何为每个子任务选择合适的LLM层级成为平衡成本与性能的关键挑战。为此,我们提出无需训练的LLM自动变速器(LLM-AT)框架,该框架由启动器、生成器和评判器组成:启动器选择预期能解决给定问题的初始LLM层级,生成器使用所选层级的LLM生成响应,评判器则评估响应的有效性。若响应无效,LLM-AT将迭代升级至更高层级模型,重新生成响应并评估,直至获得有效响应。此外,我们提出准确率估计器,该组件无需训练即可实现合适的初始LLM层级选择——通过计算历史推理记录中top-k相似查询的有效响应率,预估每个LLM层级对输入问题的预期准确率。实验表明,LLM-AT在降低成本的同时实现了卓越性能,为实际应用提供了实用解决方案。


Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement

Abstract

arXiv:2505.20973v1 Announce Type: cross Abstract: Foundation Models (FMs) have shown remarkable capabilities in various natural language tasks. However, their ability to accurately capture stakeholder requirements remains a significant challenge for using FMs for software development. This paper introduces a novel approach that leverages an FM-powered multi-agent system called AlignMind to address this issue. By having a cognitive architecture that enhances FMs with Theory-of-Mind capabilities, our approach considers the mental states and perspectives of software makers. This allows our solution to iteratively clarify the beliefs, desires, and intentions of stakeholders, translating these into a set of refined requirements and a corresponding actionable natural language workflow in the often-overlooked requirements refinement phase of software engineering, which is crucial after initial elicitation. Through a multifaceted evaluation covering 150 diverse use cases, we demonstrate that our approach can accurately capture the intents and requirements of stakeholders, articulating them as both specifications and a step-by-step plan of action. Our findings suggest that the potential for significant improvements in the software development process justifies these investments. Our work lays the groundwork for future innovation in building intent-first development environments, where software makers can seamlessly collaborate with AIs to create software that truly meets their needs.

摘要

基础模型(FMs)在各种自然语言任务中展现出卓越能力,但其准确捕捉利益相关者需求的能力仍是将其应用于软件开发的重要挑战。本文提出一种创新方法,通过名为AlignMind的FM驱动多智能体系统解决该问题。该方法采用增强FMs心理理论能力的认知架构,充分考虑软件制作者的心理状态和视角,从而能在软件工程中常被忽视的需求细化阶段(初始获取后的关键环节)迭代澄清利益相关者的信念、愿望和意图,并将其转化为精细化需求集及对应的可执行自然语言工作流。基于涵盖150个多样化用例的多维度评估,我们证明该方法能精准捕获利益相关者意图和需求,并将其明确表述为规范文档和分步执行计划。研究结果表明,这些投入将带来软件开发过程的显著改进潜力。本研究为构建"意图优先"开发环境的未来创新奠定基础,使软件制作者能与人工智能无缝协作,真正开发出符合需求的软件。


Who Reasons in the Large Language Models?

Abstract

arXiv:2505.20993v1 Announce Type: cross Abstract: Despite the impressive performance of large language models (LLMs), the process of endowing them with new capabilities--such as mathematical reasoning--remains largely empirical and opaque. A critical open question is whether reasoning abilities stem from the entire model, specific modules, or are merely artifacts of overfitting. In this work, we hypothesize that the reasoning capabilities in well-trained LLMs are primarily attributed to the output projection module (oproj) in the Transformer's multi-head self-attention (MHSA) mechanism. To support this hypothesis, we introduce Stethoscope for Networks (SfN), a suite of diagnostic tools designed to probe and analyze the internal behaviors of LLMs. Using SfN, we provide both circumstantial and empirical evidence suggesting that oproj plays a central role in enabling reasoning, whereas other modules contribute more to fluent dialogue. These findings offer a new perspective on LLM interpretability and open avenues for more targeted training strategies, potentially enabling more efficient and specialized LLMs.

摘要

尽管大型语言模型(LLMs)展现出卓越的性能,但赋予其新能力(如数学推理)的过程仍主要依赖经验且缺乏透明度。一个关键悬而未决的问题是:推理能力究竟源于整个模型、特定模块,还是仅仅是过拟合的产物。本研究提出假设:在训练良好的LLMs中,推理能力主要归因于Transformer多头自注意力机制(MHSA)中的输出投影模块(oproj)。为验证该假设,我们开发了网络诊断工具套件Stethoscope for Networks(SfN),用于探测和分析LLMs的内部行为。通过SfN,我们提供了间接与实证证据,表明oproj在实现推理功能中起核心作用,而其他模块更多贡献于流畅对话。这些发现为LLM可解释性提供了新视角,并为开发更具针对性的训练策略开辟了途径,有望实现更高效、更专业化的LLMs。


Reason-Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA

Abstract

arXiv:2505.20971v1 Announce Type: cross Abstract: LLMs have demonstrated remarkable capabilities in complex reasoning tasks, yet they often suffer from hallucinations and lack reliable factual grounding. Meanwhile, knowledge graphs (KGs) provide structured factual knowledge but lack the flexible reasoning abilities of LLMs. In this paper, we present Reason-Align-Respond (RAR), a novel framework that systematically integrates LLM reasoning with knowledge graphs for KGQA. Our approach consists of three key components: a Reasoner that generates human-like reasoning chains, an Aligner that maps these chains to valid KG paths, and a Responser that synthesizes the final answer. We formulate this process as a probabilistic model and optimize it using the Expectation-Maximization algorithm, which iteratively refines the reasoning chains and knowledge paths. Extensive experiments on multiple benchmarks demonstrate the effectiveness of RAR, achieving state-of-the-art performance with Hit@1 scores of 93.3% and 91.0% on WebQSP and CWQ respectively. Human evaluation confirms that RAR generates high-quality, interpretable reasoning chains well-aligned with KG paths. Furthermore, RAR exhibits strong zero-shot generalization capabilities and maintains computational efficiency during inference.

摘要

大型语言模型(LLMs)在复杂推理任务中展现出卓越能力,但常存在幻觉问题且缺乏可靠的事实依据。与此同时,知识图谱(KGs)虽提供结构化事实知识,却缺乏LLMs的灵活推理能力。本文提出Reason-Align-Respond(RAR)框架,通过系统整合LLM推理与知识图谱来实现知识图谱问答(KGQA)。该框架包含三个核心组件:生成类人推理链的推理器(Reasoner)、将推理链映射至有效KG路径的对齐器(Aligner),以及合成最终答案的响应器(Responser)。我们将此过程建模为概率模型,并采用期望最大化算法进行优化,迭代精炼推理链与知识路径。在多基准测试上的大量实验表明,RAR在WebQSP和CWQ数据集上分别以93.3%和91.0%的Hit@1分数达到最先进性能。人工评估证实RAR生成的推理链质量高、可解释性强,且与KG路径高度吻合。此外,RAR展现出强大的零样本泛化能力,并在推理过程中保持计算高效性。


SageAttention2++: A More Efficient Implementation of SageAttention2

Abstract

arXiv:2505.21136v1 Announce Type: cross Abstract: The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at https://github.com/thu-ml/SageAttention.

摘要

注意力机制的效率至关重要,因为其时间复杂度随序列长度呈二次方增长。SageAttention2通过量化技术加速注意力中的矩阵乘法(Matmul)来解决这一问题。为进一步提升SageAttention2速度,我们提出采用FP8矩阵乘法(结果以FP16累加)的快速指令,该指令比SageAttention2原采用的FP8矩阵乘法快2倍。实验表明,SageAttention2++在保持与SageAttention2相同注意力精度的同时,相比FlashAttention实现了3.9倍加速。这意味着SageAttention2++能有效加速语言、图像和视频生成等各类模型,且端到端指标损失可忽略。代码将在https://github.com/thu-ml/SageAttention发布。


SHE-LoRA: Selective Homomorphic Encryption for Federated Tuning with Heterogeneous LoRA

Abstract

arXiv:2505.21051v1 Announce Type: cross Abstract: Federated fine-tuning of large language models (LLMs) is critical for improving their performance in handling domain-specific tasks. However, prior work has shown that clients' private data can actually be recovered via gradient inversion attacks. Existing privacy preservation techniques against such attacks typically entail performance degradation and high costs, making them ill-suited for clients with heterogeneous data distributions and device capabilities. In this paper, we propose SHE-LoRA, which integrates selective homomorphic encryption (HE) and low-rank adaptation (LoRA) to enable efficient and privacy-preserving federated tuning of LLMs in cross-device environment. Heterogeneous clients adaptively select partial model parameters for homomorphic encryption based on parameter sensitivity assessment, with the encryption subset obtained via negotiation. To ensure accurate model aggregation, we design a column-aware secure aggregation method and customized reparameterization techniques to align the aggregation results with the heterogeneous device capabilities of clients. Extensive experiments demonstrate that SHE-LoRA maintains performance comparable to non-private baselines, achieves strong resistance to the state-of-the-art attacks, and significantly reduces communication overhead by 94.901% and encryption computation overhead by 99.829%, compared to baseline. Our code is accessible at https://anonymous.4open.science/r/SHE-LoRA-8D84.

摘要

大型语言模型(LLMs)的联邦微调对于提升其处理领域特定任务的性能至关重要。然而,已有研究表明,通过梯度反演攻击可实际恢复客户端的私有数据。针对此类攻击的现有隐私保护技术通常会导致性能下降和高昂成本,使其难以适配数据分布与设备能力异构的客户端。本文提出SHE-LoRA框架,通过整合选择性同态加密(HE)与低秩自适应(LoRA)技术,实现跨设备环境下高效且隐私保护的LLMs联邦调优。异构客户端基于参数敏感性评估自适应选择部分模型参数进行同态加密,并通过协商获取加密子集。为确保精准的模型聚合,我们设计了列感知的安全聚合方法和定制化重参数技术,使聚合结果与客户端的异构设备能力相匹配。大量实验表明,SHE-LoRA在保持与非隐私基线相当性能的同时,对最先进攻击具有强抵抗性,且相较于基线显著降低94.901%的通信开销和99.829%的加密计算开销。代码已开源:https://anonymous.4open.science/r/SHE-LoRA-8D84。


Creativity in LLM-based Multi-Agent Systems: A Survey

Abstract

arXiv:2505.21116v1 Announce Type: cross Abstract: Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts. While existing surveys provide comprehensive overviews of MAS infrastructures, they largely overlook the dimension of \emph{creativity}, including how novel outputs are generated and evaluated, how creativity informs agent personas, and how creative workflows are coordinated. This is the first survey dedicated to creativity in MAS. We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks. This survey offers a structured framework and roadmap for advancing the development, evaluation, and standardization of creative MAS.

摘要

由大语言模型(LLM)驱动的多智能体系统(MAS)正在改变人类与AI协作生成创意和产物的方式。尽管现有综述对MAS基础设施进行了全面概述,但大多忽视了\emph{创造性}这一维度,包括新颖输出如何生成与评估、创造性如何塑造智能体角色,以及创意工作流程如何协调。本文是首篇专注于MAS创造性的综述,聚焦文本与图像生成任务,提出:(1)智能体主动性与角色设计的分类体系;(2)生成技术概览,包括发散探索、迭代优化和协作合成,以及相关数据集与评估指标;(3)对关键挑战的讨论,如评估标准不一致、偏见缓解不足、协调冲突及缺乏统一基准。本综述为推进创造性MAS的开发、评估与标准化提供了结构化框架和路线图。


Efficient Large Language Model Inference with Neural Block Linearization

Abstract

arXiv:2505.21077v1 Announce Type: cross Abstract: The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in DeepSeek-R1-Distill-Llama-8B increases the inference speed by 32% with less than 1% accuracy trade-off, making it a flexible and promising solution to improve the inference efficiency of LLMs.

摘要

基于Transformer架构的大语言模型(LLM)在推理阶段的高计算需求给实际部署带来了重大挑战。为此,我们提出神经块线性化(NBL)框架,通过用线性最小均方误差估计器导出的线性近似替换自注意力层,实现Transformer模型的推理加速。该方法采用典型相关分析计算近似误差的理论上界,并将该界限作为层替换的判定标准,优先选择线性化误差最低的LLM层进行替换。NBL无需微调即可高效应用于预训练大语言模型。实验表明,该框架在保持多项推理基准测试竞争力的同时,显著提升了计算速度。例如,在DeepSeek-R1-Distill-Llama-8B模型中替换12个自注意力层后,推理速度提升32%且精度损失不足1%,证明NBL是提升大语言模型推理效率的灵活且具有前景的解决方案。


BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge

Abstract

arXiv:2505.21092v1 Announce Type: cross Abstract: In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh's culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs' performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali's status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.

摘要

在本研究中,我们推出了BLUCK数据集,该数据集旨在评估大语言模型(LLMs)在孟加拉语语言理解和文化知识方面的表现。我们的数据集包含2366道精心编制的选择题(MCQs),题目来源涵盖多所大学及职业级别考试的题库,涉及23个类别,包括孟加拉国文化与历史知识以及孟加拉语言学。我们使用6个专有模型和3个开源LLM(包括GPT-4o、Claude-3.5-Sonnet、Gemini-1.5-Pro、Llama-3.3-70B-Instruct和DeepSeekV3)对BLUCK进行了基准测试。结果表明,尽管这些模型整体表现尚可,但在孟加拉语音学某些领域仍存在困难。虽然当前LLMs在孟加拉文化及语言语境中的表现仍无法与英语等主流语言相媲美,但我们的研究证实孟加拉语属于中等资源语言。值得注意的是,BLUCK也是首个以本土孟加拉文化、历史及语言学为核心的选择题评估基准。


Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling

Abstract

arXiv:2505.21074v1 Announce Type: cross Abstract: Text-to-image (T2I) models raise ethical and safety concerns due to their potential to generate inappropriate or harmful images. Evaluating these models' security through red-teaming is vital, yet white-box approaches are limited by their need for internal access, complicating their use with closed-source models. Moreover, existing black-box methods often assume knowledge about the model's specific defense mechanisms, limiting their utility in real-world commercial API scenarios. A significant challenge is how to evade unknown and diverse defense mechanisms. To overcome this difficulty, we propose a novel Rule-based Preference modeling Guided Red-Teaming (RPG-RT), which iteratively employs LLM to modify prompts to query and leverages feedback from T2I systems for fine-tuning the LLM. RPG-RT treats the feedback from each iteration as a prior, enabling the LLM to dynamically adapt to unknown defense mechanisms. Given that the feedback is often labeled and coarse-grained, making it difficult to utilize directly, we further propose rule-based preference modeling, which employs a set of rules to evaluate desired or undesired feedback, facilitating finer-grained control over the LLM's dynamic adaptation process. Extensive experiments on nineteen T2I systems with varied safety mechanisms, three online commercial API services, and T2V models verify the superiority and practicality of our approach.

摘要

文本到图像(T2I)模型因其可能生成不当或有害图像而引发伦理和安全担忧。通过红队测试评估这些模型的安全性至关重要,但白盒方法因需要内部访问而受限,难以应用于闭源模型。此外,现有的黑盒方法通常假设已知模型的特定防御机制,限制了其在现实商业API场景中的实用性。一个关键挑战是如何规避未知且多样化的防御机制。为解决这一难题,我们提出了一种新颖的基于规则偏好建模的引导红队测试(RPG-RT),该方法迭代地利用大型语言模型(LLM)修改提示词进行查询,并借助T2I系统的反馈对LLM进行微调。RPG-RT将每次迭代的反馈视为先验知识,使LLM能够动态适应未知防御机制。鉴于反馈通常带有标签且粒度较粗,难以直接利用,我们进一步提出基于规则的偏好建模,通过一组规则评估期望或不期望的反馈,从而实现对LLM动态适应过程的更精细控制。在十九种具有不同安全机制的T2I系统、三种在线商业API服务及T2V模型上的大量实验验证了该方法的优越性和实用性。


A Lightweight Multi-Expert Generative Language Model System for Engineering Information and Knowledge Extraction

Abstract

arXiv:2505.21109v1 Announce Type: cross Abstract: Despite recent advancements in domain adaptation techniques for large language models, these methods remain computationally intensive, and the resulting models can still exhibit hallucination issues. Most existing adaptation methods do not prioritize reducing the computational resources required for fine-tuning and inference of language models. Hallucination issues have gradually decreased with each new model release. However, they remain prevalent in engineering contexts, where generating well-structured text with minimal errors and inconsistencies is critical. This work introduces a novel approach called the Small Language Graph (SLG), which is a lightweight adaptation solution designed to address the two key challenges outlined above. The system is structured in the form of a graph, where each node represents a lightweight expert - a small language model fine-tuned on specific and concise texts. The results of this study have shown that SLG was able to surpass conventional fine-tuning methods on the Exact Match metric by 3 times. Additionally, the fine-tuning process was 1.7 times faster compared to that of a larger stand-alone language model. These findings introduce a potential for small to medium-sized engineering companies to confidently use generative AI technologies, such as LLMs, without the necessity to invest in expensive computational resources. Also, the graph architecture and the small size of expert nodes offer a possible opportunity for distributed AI systems, thus potentially diverting the global need for expensive centralized compute clusters.

摘要

尽管大规模语言模型的领域适应技术近期取得了进展,但这些方法仍然计算密集,且所得模型仍可能产生幻觉问题。现有的大多数适应方法并未优先考虑降低语言模型微调和推理所需的计算资源。虽然随着新模型的发布,幻觉问题已逐步减少,但在工程应用场景中仍普遍存在——这些场景要求生成结构良好且错误与矛盾最少的文本。本研究提出了一种名为"小型语言图"(SLG)的创新方法,该轻量级适应方案旨在解决上述两个关键挑战。该系统采用图结构构建,其中每个节点代表一个轻量级专家——即在特定简洁文本上微调的小型语言模型。研究结果表明,SLG在精确匹配指标上能够超越传统微调方法3倍。同时,其微调过程相比独立大型语言模型提速1.7倍。这些发现为中小型工程公司创造了可能性,使其无需投资昂贵计算资源即可放心使用生成式AI技术(如大型语言模型)。此外,图架构与小型专家节点为分布式AI系统提供了潜在机遇,可能由此改变全球对昂贵集中式计算集群的需求格局。


Thinker: Learning to Think Fast and Slow

Abstract

arXiv:2505.21097v1 Announce Type: cross Abstract: Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 24.9% to 27.9% for Qwen2.5-1.5B, and from 45.9% to 49.8% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 26.8% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training.

摘要

近期研究表明,通过对数学和编程等领域的问答任务应用强化学习(RL),可以提升大语言模型(LLMs)的推理能力。在长上下文条件下,LLMs可能学会执行搜索行为(如DeepSeek R1中观察到的自我纠正现象所示),但此类搜索往往不够精确且缺乏置信度,导致生成冗长冗余的响应,并暴露出直觉与验证能力的不足。受心理学双过程理论启发,我们提出一种改进问答任务的简易方法,包含四个阶段:快速思维(要求模型在严格token限制内作答)、验证(模型评估初始回答)、慢速思维(经深入思考修正初始回答)以及总结(提炼前阶段修正内容为精确步骤)。该方法使Qwen2.5-1.5B的平均准确率从24.9%提升至27.9%,DeepSeek-R1-Qwen-1.5B从45.9%提升至49.8%。值得注意的是,Qwen2.5-1.5B仅通过快速思维模式(使用少于1000个token)即可达到26.8%的准确率,显示出显著的推理效率提升。这些发现表明,直觉与审慎推理是两种独立且互补的系统,可通过针对性训练获得协同提升。


M-Wanda: Improving One-Shot Pruning for Multilingual LLMs

Abstract

arXiv:2505.21171v1 Announce Type: cross Abstract: Multilingual LLM performance is often critically dependent on model size. With an eye on efficiency, this has led to a surge in interest in one-shot pruning methods that retain the benefits of large-scale pretraining while shrinking the model size. However, as pruning tends to come with performance loss, it is important to understand the trade-offs between multilinguality and sparsification. In this work, we study multilingual performance under different sparsity constraints and show that moderate ratios already substantially harm performance. To help bridge this gap, we propose M-Wanda, a pruning method that models cross-lingual variation by incorporating language-aware activation statistics into its pruning criterion and dynamically adjusts layerwise sparsity based on cross-lingual importance. We show that M-Wanda consistently improves performance at minimal additional costs. We are the first to explicitly optimize pruning to retain multilingual performance, and hope to inspire future advances in multilingual pruning.

摘要

多语言大语言模型的性能往往高度依赖于模型规模。出于效率考量,这一现象引发了人们对单次剪枝方法的浓厚兴趣,该方法能在缩小模型规模的同时保留大规模预训练的优势。然而,由于剪枝通常伴随性能损失,理解多语言性与稀疏化之间的权衡至关重要。本研究探讨了不同稀疏约束下的多语言表现,证明中等稀疏比例已会显著损害性能。为弥合这一差距,我们提出M-Wanda剪枝方法,该方法通过将语言感知的激活统计纳入剪枝准则来建模跨语言差异,并根据跨语言重要性动态调整分层稀疏度。实验表明M-Wanda能以极低额外成本持续提升性能。我们首次明确优化剪枝以保持多语言性能,期望能启发多语言剪枝领域的未来进展。


PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing

Abstract

arXiv:2505.21184v1 Announce Type: cross Abstract: To construct responsible and secure AI applications, harmful information data is widely utilized for adversarial testing and the development of safeguards. Existing studies mainly leverage Large Language Models (LLMs) to synthesize data to obtain high-quality task datasets at scale, thereby avoiding costly human annotation. However, limited by the safety alignment mechanisms of LLMs, the synthesis of harmful data still faces challenges in generation reliability and content diversity. In this study, we propose a novel harmful information synthesis framework, PoisonSwarm, which applies the model crowdsourcing strategy to generate diverse harmful data while maintaining a high success rate. Specifically, we generate abundant benign data as the based templates in a counterfactual manner. Subsequently, we decompose each based template into multiple semantic units and perform unit-by-unit toxification and final refinement through dynamic model switching, thus ensuring the success of synthesis. Experimental results demonstrate that PoisonSwarm achieves state-of-the-art performance in synthesizing different categories of harmful data with high scalability and diversity.

摘要

为构建负责任且安全的人工智能应用,有害信息数据被广泛用于对抗性测试及防护措施开发。现有研究主要利用大语言模型(LLMs)合成数据以大规模获取高质量任务数据集,从而避免高昂的人工标注成本。然而受限于LLMs的安全对齐机制,有害数据的合成在生成可靠性与内容多样性方面仍面临挑战。本研究提出一种新型有害信息合成框架PoisonSwarm,通过采用模型众包策略,在保持高成功率的同时生成多样化有害数据。具体而言,我们以反事实方式生成大量良性数据作为基础模板,随后将每个基础模板分解为多个语义单元,通过动态模型切换进行逐单元毒化及最终优化,从而确保合成成功率。实验结果表明,PoisonSwarm在合成不同类别有害数据时具有高度可扩展性和多样性,其性能达到当前最优水平。


Exploring the Latent Capacity of LLMs for One-Step Text Generation

Abstract

arXiv:2505.21189v1 Announce Type: cross Abstract: A recent study showed that large language models (LLMs) can reconstruct surprisingly long texts - up to thousands of tokens - via autoregressive generation from just one specially trained input embedding. In this work, we explore whether such reconstruction is possible without autoregression. We show that frozen LLMs can generate hundreds of accurate tokens in just one forward pass, when provided with only two learned embeddings. This reveals a surprising and underexplored capability of LLMs - multi-token generation without iterative decoding. We investigate the behaviour of these embeddings and provide insight into the type of information they encode. We also empirically show that although these representations are not unique for a given text, they form connected and local regions in embedding space - a property that suggests the potential of learning a dedicated encoder into that space.

摘要

最近一项研究表明,大型语言模型(LLMs)仅需通过一个经过特殊训练的输入嵌入,就能通过自回归生成方式重构出惊人长度的文本(可达数千个标记)。本研究中,我们探讨了这种重构是否可以在非自回归条件下实现。实验证明,当仅提供两个学习得到的嵌入时,冻结参数的LLMs仅需单次前向传播即可生成数百个准确标记。这一发现揭示了LLMs尚未被充分探索的惊人能力——无需迭代解码即可实现多标记生成。我们研究了这些嵌入向量的行为特征,并深入解析了其编码的信息类型。实证研究表明,尽管这些表征对于给定文本并非唯一,但它们在嵌入空间中形成了连通且局部的区域——这一特性暗示了学习专用编码器进入该空间的潜在可能性。


Pretrained LLMs Learn Multiple Types of Uncertainty

Abstract

arXiv:2505.21218v1 Announce Type: cross Abstract: Large Language Models are known to capture real-world knowledge, allowing them to excel in many downstream tasks. Despite recent advances, these models are still prone to what are commonly known as hallucinations, causing them to emit unwanted and factually incorrect text. In this work, we study how well LLMs capture uncertainty, without explicitly being trained for that. We show that, if considering uncertainty as a linear concept in the model's latent space, it might indeed be captured, even after only pretraining. We further show that, though unintuitive, LLMs appear to capture several different types of uncertainty, each of which can be useful to predict the correctness for a specific task or benchmark. Furthermore, we provide in-depth results such as demonstrating a correlation between our correction prediction and the model's ability to abstain from misinformation using words, and the lack of impact of model scaling for capturing uncertainty. Finally, we claim that unifying the uncertainty types as a single one using instruction-tuning or [IDK]-token tuning is helpful for the model in terms of correctness prediction.

摘要

众所周知,大语言模型能够捕捉现实世界知识,使其在下游任务中表现优异。尽管近期取得进展,这些模型仍易产生所谓"幻觉",导致生成不必要且事实错误的文本。本研究探讨语言模型在未经明确训练的情况下对不确定性的捕捉能力。研究表明,若将不确定性视为模型潜在空间中的线性概念,即使在预训练后也可能被捕获。进一步发现,尽管有违直觉,语言模型似乎能捕捉多种不同类型的不确定性,每种类型皆可用于预测特定任务或基准的正确性。此外,我们提供了深度分析结果,包括:证明校正预测与模型通过词语避免错误信息的能力存在相关性,以及模型规模对不确定性捕捉缺乏影响。最后,我们提出通过指令微调或[IDK]标记微调将各类不确定性统一为单一类型,有助于提升模型的正确性预测能力。


Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs)

Abstract

arXiv:2505.21091v1 Announce Type: cross Abstract: System prompts in Large Language Models (LLMs) are predefined directives that guide model behaviour, taking precedence over user inputs in text processing and generation. LLM deployers increasingly use them to ensure consistent responses across contexts. While model providers set a foundation of system prompts, deployers and third-party developers can append additional prompts without visibility into others' additions, while this layered implementation remains entirely hidden from end-users. As system prompts become more complex, they can directly or indirectly introduce unaccounted for side effects. This lack of transparency raises fundamental questions about how the position of information in different directives shapes model outputs. As such, this work examines how the placement of information affects model behaviour. To this end, we compare how models process demographic information in system versus user prompts across six commercially available LLMs and 50 demographic groups. Our analysis reveals significant biases, manifesting in differences in user representation and decision-making scenarios. Since these variations stem from inaccessible and opaque system-level configurations, they risk representational, allocative and potential other biases and downstream harms beyond the user's ability to detect or correct. Our findings draw attention to these critical issues, which have the potential to perpetuate harms if left unexamined. Further, we argue that system prompt analysis must be incorporated into AI auditing processes, particularly as customisable system prompts become increasingly prevalent in commercial AI deployments.

摘要

大语言模型(LLMs)中的系统提示是预定义的指令,用于引导模型行为,在文本处理与生成过程中优先于用户输入。模型部署者日益依赖系统提示来确保跨语境响应的一致性。尽管模型提供者设定了系统提示的基础框架,但部署者与第三方开发者可在不透明的情况下追加额外提示,而终端用户完全无法察觉这种分层实现机制。随着系统提示日趋复杂,它们可能直接或间接引发未预期的副作用。这种透明度的缺失引发了根本性问题:信息在不同指令中的位置如何影响模型输出?为此,本研究探究了信息位置对模型行为的影响。我们通过六种商用LLMs和50个人口统计组别,比较了模型处理系统提示与用户提示中人口统计信息的差异。分析表明存在显著偏见,体现在用户表征与决策场景的差异中。由于这些差异源于不可访问且不透明的系统级配置,可能导致表征性、分配性及其他潜在偏见与下游危害,且超出用户的检测与修正能力。本研究揭示了这些关键问题,若不加以审视可能持续造成危害。我们进一步主张,必须将系统提示分析纳入AI审计流程,尤其是在可定制系统提示于商业AI部署中日益普及的背景下。


How Humans and LLMs Organize Conceptual Knowledge: Exploring Subordinate Categories in Italian

Abstract

arXiv:2505.21301v1 Announce Type: cross Abstract: People can categorize the same entity at multiple taxonomic levels, such as basic (bear), superordinate (animal), and subordinate (grizzly bear). While prior research has focused on basic-level categories, this study is the first attempt to examine the organization of categories by analyzing exemplars produced at the subordinate level. We present a new Italian psycholinguistic dataset of human-generated exemplars for 187 concrete words. We then use these data to evaluate whether textual and vision LLMs produce meaningful exemplars that align with human category organization across three key tasks: exemplar generation, category induction, and typicality judgment. Our findings show a low alignment between humans and LLMs, consistent with previous studies. However, their performance varies notably across different semantic domains. Ultimately, this study highlights both the promises and the constraints of using AI-generated exemplars to support psychological and linguistic research.

摘要

人们能够在多个分类层级上对同一实体进行范畴化,例如基本层级(熊)、上位层级(动物)和下位层级(灰熊)。尽管先前研究主要关注基本层级范畴,但本研究首次尝试通过分析下位层级产生的范例来探究范畴的组织结构。我们提出了一个新的意大利语心理语言学数据集,包含187个具体词汇的人类生成范例。随后,我们利用这些数据评估文本和视觉大语言模型(LLMs)在三个关键任务(范例生成、范畴归纳和典型性判断)中是否能产生与人类范畴组织一致的有意义范例。研究结果显示,人类与LLMs之间的对齐程度较低,这与既往研究一致。然而,它们在不同语义领域的表现存在显著差异。最终,本研究揭示了使用AI生成范例支持心理学和语言学研究的潜力与局限性。


Improving LLM-based Global Optimization with Search Space Partitioning

Abstract

arXiv:2505.21372v1 Announce Type: cross Abstract: Large Language Models (LLMs) have recently emerged as effective surrogate models and candidate generators within global optimization frameworks for expensive blackbox functions. Despite promising results, LLM-based methods often struggle in high-dimensional search spaces or when lacking domain-specific priors, leading to sparse or uninformative suggestions. To overcome these limitations, we propose HOLLM, a novel global optimization algorithm that enhances LLM-driven sampling by partitioning the search space into promising subregions. Each subregion acts as a ``meta-arm'' selected via a bandit-inspired scoring mechanism that effectively balances exploration and exploitation. Within each selected subregion, an LLM then proposes high-quality candidate points, without any explicit domain knowledge. Empirical evaluation on standard optimization benchmarks shows that HOLLM consistently matches or surpasses leading Bayesian optimization and trust-region methods, while substantially outperforming global LLM-based sampling strategies.

摘要

大型语言模型(LLMs)近期在昂贵黑箱函数的全局优化框架中展现出作为高效代理模型和候选生成器的潜力。尽管成果显著,基于LLM的方法在高维搜索空间或缺乏领域先验知识时往往表现不佳,导致生成建议稀疏或信息量不足。为克服这些局限,我们提出HOLLM——一种新颖的全局优化算法,通过将搜索空间划分为有前景的子区域来增强LLM驱动的采样。每个子区域作为"元臂"通过受多臂老虎机启发的评分机制进行选择,该机制有效平衡探索与利用。在选定的子区域内,LLM无需显式领域知识即可生成高质量候选点。标准优化基准测试表明,HOLLM在性能上持续匹配或超越主流贝叶斯优化与信赖域方法,同时显著优于全局性的基于LLM的采样策略。


Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space

Abstract

arXiv:2505.21277v1 Announce Type: cross Abstract: Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops genetic-based optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: https://github.com/Aries-iai/CL-GSO.

摘要

尽管大型语言模型(LLMs)具备先进的通用能力,但仍存在诸多安全风险,尤其是可绕过安全协议的越狱攻击。通过更贴近真实场景的黑盒越狱攻击来理解这些漏洞,能为模型鲁棒性提供关键洞见。现有方法虽通过多种提示工程技术取得改进,但其对安全对齐模型的成功率仍然有限,这忽视了一个更根本的问题:攻击效果本质上受限于预定义的策略空间边界。然而扩展该策略空间面临双重挑战:既要系统化捕捉核心攻击模式,又需高效应对由此增加的复杂度。为深入探索策略空间扩展的潜力,我们通过新颖框架应对这些挑战:基于精细加工可能性模型(ELM)理论将越狱策略分解为核心组件,并开发出结合意图评估机制的遗传优化方法。实验结果表明,策略空间扩展带来了突破性的越狱能力:在Claude-3.5模型上取得超过90%的成功率(现有方法完全失效),同时展现出强大的跨模型可迁移性,并在评估准确率上超越专用防护模型。代码已开源:https://github.com/Aries-iai/CL-GSO。


Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

Abstract

arXiv:2505.21364v1 Announce Type: cross Abstract: Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping--significantly increasing model's next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights--preserving the original decoders' expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language--opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.

摘要

多层感知机(MLP)是大型语言模型的核心组件,但其稠密表示特性导致模型难以理解、编辑与调控。现有方法通过神经元级稀疏性学习可解释的近似表示,但无法忠实重构原始映射——会显著增加模型的下一词元交叉熵损失。本文提出转向层级稀疏性以克服稀疏层近似中的精度权衡问题。基于此范式,我们提出解码器混合架构(MxD)。MxD泛化了MLP和门控线性单元,将预训练稠密层扩展为数万个专用子层。通过灵活的张量分解形式,每个稀疏激活的MxD子层均实现具有全秩权重的线性变换——即使在高度稀疏条件下仍能保持原始解码器的表达能力。实验表明,在参数规模达30亿的语言模型中,MxD在稀疏性-精度边界上显著优于Transcoders等前沿方法。稀疏探测与特征调控的进一步评估证实,MxD能学习到自然语言中类似的专用特征——为设计兼具可解释性与保真度的分解架构开辟了新途径。代码已开源:https://github.com/james-oldfield/MxD/。


Factual Self-Awareness in Language Models: Representation, Robustness, and Scaling

Abstract

arXiv:2505.21399v1 Announce Type: cross Abstract: Factual incorrectness in generated content is one of the primary concerns in ubiquitous deployment of large language models (LLMs). Prior findings suggest LLMs can (sometimes) detect factual incorrectness in their generated content (i.e., fact-checking post-generation). In this work, we provide evidence supporting the presence of LLMs' internal compass that dictate the correctness of factual recall at the time of generation. We demonstrate that for a given subject entity and a relation, LLMs internally encode linear features in the Transformer's residual stream that dictate whether it will be able to recall the correct attribute (that forms a valid entity-relation-attribute triplet). This self-awareness signal is robust to minor formatting variations. We investigate the effects of context perturbation via different example selection strategies. Scaling experiments across model sizes and training dynamics highlight that self-awareness emerges rapidly during training and peaks in intermediate layers. These findings uncover intrinsic self-monitoring capabilities within LLMs, contributing to their interpretability and reliability.

摘要

生成内容中的事实性错误是大规模语言模型(LLMs)广泛应用中的主要担忧之一。先前研究表明,LLMs(有时)能够检测其生成内容中的事实性错误(即生成后的事实核查)。本研究中,我们提供证据支持LLMs内部存在一种决定事实回忆正确性的内在导向机制。我们证明,对于给定的主体实体和关系,LLMs在Transformer残差流中编码了线性特征,这些特征决定了模型是否能够回忆出正确的属性(从而形成有效的实体-关系-属性三元组)。这种自我觉察信号对微小格式变化具有鲁棒性。我们通过不同示例选择策略研究了上下文扰动的影响。跨模型规模和训练动态的扩展实验表明,自我觉察能力在训练过程中快速形成,并在中间层达到峰值。这些发现揭示了LLMs内在的自我监控能力,有助于提升其可解释性与可靠性。


RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models

Abstract

arXiv:2505.21409v1 Announce Type: cross Abstract: Factuality in Large Language Models (LLMs) is a persistent challenge. Current benchmarks often assess short factual answers, overlooking the critical ability to generate structured, multi-record tabular outputs from parametric knowledge. We demonstrate that this relational fact retrieval is substantially more difficult than isolated point-wise queries, even when individual facts are known to the model, exposing distinct failure modes sensitive to output dimensionality (e.g., number of attributes or records). To systematically evaluate this under-explored capability, we introduce RelationalFactQA, a new benchmark featuring diverse natural language questions (paired with SQL) and gold-standard tabular answers, specifically designed to assess knowledge retrieval in a structured format. RelationalFactQA enables analysis across varying query complexities, output sizes, and data characteristics. Our experiments reveal that even state-of-the-art LLMs struggle significantly, not exceeding 25% factual accuracy in generating relational outputs, with performance notably degrading as output dimensionality increases. These findings underscore critical limitations in current LLMs' ability to synthesize structured factual knowledge and establish RelationalFactQA as a crucial resource for measuring future progress in LLM factuality.

摘要

大型语言模型(LLMs)的事实性始终是一个持续挑战。现有基准测试通常评估简短的事实性回答,忽视了从参数化知识生成结构化、多记录表格输出的关键能力。我们证明这种关系型事实检索比孤立的点状查询更为困难——即使模型已知单个事实,仍会暴露出对输出维度(如属性或记录数量)敏感的独特故障模式。为系统评估这一尚未充分探索的能力,我们提出RelationalFactQA基准测试,其包含多样化的自然语言问题(与SQL配对)和标准表格答案,专门用于评估结构化格式的知识检索。该基准支持跨查询复杂度、输出规模和数据特征的分析。实验表明,即使最先进的LLMs在生成关系型输出时也表现欠佳,事实准确率不超过25%,且性能随输出维度增加显著下降。这些发现揭示了当前LLMs在合成结构化事实知识方面的关键局限性,同时确立了RelationalFactQA作为衡量LLM事实性未来进展的重要基准资源。


Improving Research Idea Generation Through Data: An Empirical Investigation in Social Science

Abstract

arXiv:2505.21396v1 Announce Type: cross Abstract: Recent advancements in large language models (LLMs) have shown promise in generating novel research ideas. However, these ideas often face challenges related to feasibility and expected effectiveness. This paper explores how augmenting LLMs with relevant data during the idea generation process can enhance the quality of generated ideas. We introduce two ways of incorporating data: (1) providing metadata during the idea generation stage to guide LLMs toward feasible directions, and (2) adding automatic validation during the idea selection stage to assess the empirical plausibility of hypotheses within ideas. We conduct experiments in the social science domain, specifically with climate negotiation topics, and find that metadata improves the feasibility of generated ideas by 20%, while automatic validation improves the overall quality of selected ideas by 7%. A human study shows that LLM-generated ideas, along with their related data and validation processes, inspire researchers to propose research ideas with higher quality. Our work highlights the potential of data-driven research idea generation, and underscores the practical utility of LLM-assisted ideation in real-world academic settings.

摘要

大型语言模型(LLM)的最新进展在生成新颖研究思路方面展现出潜力,但这些思路往往面临可行性与预期效果方面的挑战。本文探讨了在构思阶段通过增强LLM相关数据来提升生成思路质量的方法。我们提出两种数据整合方式:(1)在构思阶段提供元数据以引导LLM朝向可行方向;(2)在思路筛选阶段加入自动验证机制以评估假设的实证合理性。我们在社会科学领域(具体针对气候谈判主题)开展实验,发现元数据可使生成思路的可行性提升20%,而自动验证能使选定思路的整体质量提高7%。一项人工研究表明,LLM生成思路及其相关数据与验证流程能启发研究者提出更高质量的研究构想。本工作揭示了数据驱动的研究思路生成潜力,并强调了LLM辅助构思在真实学术场景中的实用价值。


RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation

Abstract

arXiv:2505.21413v1 Announce Type: cross Abstract: Tools enhance the reasoning capabilities of large language models (LLMs) in complex problem-solving tasks, but not all tasks have available tools. In the absence of predefined tools, prior works have explored instructing LLMs to generate tools on their own. However, such approaches rely heavily on the models' internal knowledge and would fail in domains beyond the LLMs' knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages structured external materials such as textbooks. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 11.3% on average accuracy, while being cost-efficient and broadly generalizable. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome knowledge limitations, demonstrating the value of grounding tool creation in external references for enhanced and generalizable reasoning.

摘要

工具能够增强大语言模型(LLM)在复杂问题解决任务中的推理能力,但并非所有任务都具备现成工具。在缺乏预定义工具的情况下,先前研究尝试指导LLM自主生成工具。然而,这类方法过度依赖模型内部知识,当任务超出LLM知识范围时便会失效。为突破这一局限,我们提出RefTool——一种基于参考资料引导的自动工具创建框架,该框架利用教科书等结构化外部材料。RefTool包含两个核心模块:(1)工具创建:LLM根据参考内容生成可执行工具,通过示例验证其有效性,并将工具按层级结构组织成工具箱;(2)工具应用:LLM通过导航工具箱结构选择并调用合适工具解决问题。在因果推理、物理和化学领域的实验表明,RefTool以平均11.3%的准确率优势超越现有工具创建方法和领域专用推理方法,同时具备高成本效益和广泛泛化能力。分析表明,基于参考资料创建工具能生成精确可靠的工具,而层级结构则有效促进了工具选择。RefTool使LLM能够突破知识边界,证实了基于外部参考进行工具创建对增强推理能力和泛化性能的重要价值。


Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Abstract

arXiv:2505.21457v1 Announce Type: cross Abstract: Active vision, also known as active perception, refers to the process of actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. Recently, the use of Multimodal Large Language Models (MLLMs) as central planning and decision-making modules in robotic systems has gained extensive attention. However, despite the importance of active perception in embodied intelligence, there is little to no exploration of how MLLMs can be equipped with or learn active perception capabilities. In this paper, we first provide a systematic definition of MLLM-based active perception tasks. We point out that the recently proposed GPT-o3 model's zoom-in search strategy can be regarded as a special case of active perception; however, it still suffers from low search efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-O3, a purely reinforcement learning based training framework built on top of GRPO, designed to equip MLLMs with active perception capabilities. We further establish a comprehensive benchmark suite to evaluate ACTIVE-O3 across both general open-world tasks, such as small-object and dense object grounding, and domain-specific scenarios, including small object detection in remote sensing and autonomous driving, as well as fine-grained interactive segmentation. In addition, ACTIVE-O3 also demonstrates strong zero-shot reasoning abilities on the V* Benchmark, without relying on any explicit reasoning data. We hope that our work can provide a simple codebase and evaluation protocol to facilitate future research on active perception in MLLMs.

摘要

主动视觉(又称主动感知)是指通过主动选择观察位置和方式来获取任务相关信息的过程。作为人类和高级具身智能体高效感知与决策的关键组成部分,该能力在多模态大语言模型(MLLMs)应用于机器人系统核心规划决策模块的研究热潮中却鲜有探讨。本文首次系统定义了基于MLLMs的主动感知任务,指出近期提出的GPT-o3模型采用的放大搜索策略可视为主动感知的特例,但其仍存在搜索效率低下和区域选择不准等问题。为此,我们提出ACTIVE-O3——一个基于GRPO框架的纯强化学习训练系统,旨在赋予MLLMs主动感知能力。通过构建涵盖开放世界任务(小物体定位、密集目标 grounding)与垂直领域场景(遥感小目标检测、自动驾驶、细粒度交互式分割)的基准测试体系,验证了该方法的有效性。值得注意的是,ACTIVE-O3在V* Benchmark上展现出强大的零样本推理能力,且无需依赖任何显式推理数据。本研究期望通过提供简洁的代码库和评估协议,推动MLLMs主动感知领域的后续探索。


Policy Optimized Text-to-Image Pipeline Design

Abstract

arXiv:2505.21478v1 Announce Type: cross Abstract: Text-to-image generation has evolved beyond single monolithic models to complex multi-component pipelines. These combine fine-tuned generators, adapters, upscaling blocks and even editing steps, leading to significant improvements in image quality. However, their effective design requires substantial expertise. Recent approaches have shown promise in automating this process through large language models (LLMs), but they suffer from two critical limitations: extensive computational requirements from generating images with hundreds of predefined pipelines, and poor generalization beyond memorized training examples. We introduce a novel reinforcement learning-based framework that addresses these inefficiencies. Our approach first trains an ensemble of reward models capable of predicting image quality scores directly from prompt-workflow combinations, eliminating the need for costly image generation during training. We then implement a two-phase training strategy: initial workflow vocabulary training followed by GRPO-based optimization that guides the model toward higher-performing regions of the workflow space. Additionally, we incorporate a classifier-free guidance based enhancement technique that extrapolates along the path between the initial and GRPO-tuned models, further improving output quality. We validate our approach through a set of comparisons, showing that it can successfully create new flows with greater diversity and lead to superior image quality compared to existing baselines.

摘要

文本到图像生成技术已从单一整体模型发展为复杂的多组件流程。这些系统整合了微调生成器、适配模块、超分辨率块甚至编辑步骤,显著提升了图像质量。然而,其有效设计需要大量专业知识。近期研究尝试通过大型语言模型(LLM)实现流程自动化,但存在两个关键缺陷:需要数百个预定义流程进行图像生成的高计算成本,以及对训练样本记忆之外的泛化能力不足。我们提出了一种基于强化学习的新框架来解决这些低效问题。该方法首先训练能够直接从提示-工作流组合预测图像质量分数的奖励模型集成,避免了训练期间昂贵的图像生成成本。随后采用两阶段训练策略:先进行工作流词汇表训练,再实施基于GRPO的优化算法引导模型探索工作流空间中更高性能的区域。此外,我们引入基于无分类器引导的增强技术,通过初始模型与GRPO优化模型之间的路径外推进一步提升输出质量。通过系列对比实验验证,本方法能成功创建更具多样性的新流程,相比现有基线模型可获得更优的图像生成质量。


Hume: Introducing System-2 Thinking in Visual-Language-Action Model

Abstract

arXiv:2505.21432v1 Announce Type: cross Abstract: Humans practice slow thinking before performing actual actions when handling complex tasks in the physical world. This thinking paradigm, recently, has achieved remarkable advancement in boosting Large Language Models (LLMs) to solve complex tasks in digital domains. However, the potential of slow thinking remains largely unexplored for robotic foundation models interacting with the physical world. In this work, we propose Hume: a dual-system Vision-Language-Action (VLA) model with value-guided System-2 thinking and cascaded action denoising, exploring human-like thinking capabilities of Vision-Language-Action models for dexterous robot control. System 2 of Hume implements value-Guided thinking by extending a Vision-Language-Action Model backbone with a novel value-query head to estimate the state-action value of predicted actions. The value-guided thinking is conducted by repeat sampling multiple action candidates and selecting one according to state-action value. System 1 of Hume is a lightweight reactive visuomotor policy that takes System 2 selected action and performs cascaded action denoising for dexterous robot control. At deployment time, System 2 performs value-guided thinking at a low frequency while System 1 asynchronously receives the System 2 selected action candidate and predicts fluid actions in real time. We show that Hume outperforms the existing state-of-the-art Vision-Language-Action models across multiple simulation benchmark and real-robot deployments.

摘要

人类在处理物理世界中的复杂任务时,会在实际行动前进行慢思考。这种思维范式最近在提升大语言模型(LLMs)解决数字领域复杂任务方面取得了显著进展。然而,对于与物理世界交互的机器人基础模型而言,慢思考的潜力仍 largely unexplored。本研究提出Hume:一种具有价值引导System-2思维和级联动作去噪的双系统视觉-语言-动作(VLA)模型,旨在探索视觉-语言-动作模型在灵巧机器人控制中的人类式思维能力。Hume的System 2通过为视觉-语言-动作模型主干添加新型价值查询头来估计预测动作的状态-动作值,从而实现价值引导思维。该思维过程通过重复采样多个候选动作并根据状态-动作值进行选择来实现。Hume的System 1是轻量级反应式视觉运动策略,负责接收System 2选定的候选动作并执行级联动作去噪以实现灵巧机器人控制。在部署时,System 2以低频进行价值引导思维,而System 1异步接收System 2选择的候选动作并实时预测流畅动作。实验表明,Hume在多个模拟基准测试和真实机器人部署中均优于现有最先进的视觉-语言-动作模型。


Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making

Abstract

arXiv:2505.21503v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated strong potential in clinical question answering, with recent multi-agent frameworks further improving diagnostic accuracy via collaborative reasoning. However, we identify a recurring issue of Silent Agreement, where agents prematurely converge on diagnoses without sufficient critical analysis, particularly in complex or ambiguous cases. We present a new concept called Catfish Agent, a role-specialized LLM designed to inject structured dissent and counter silent agreement. Inspired by the ``catfish effect'' in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning. We formulate two mechanisms to encourage effective and context-aware interventions: (i) a complexity-aware intervention that modulates agent engagement based on case difficulty, and (ii) a tone-calibrated intervention articulated to balance critique and collaboration. Evaluations on nine medical Q&A and three medical VQA benchmarks show that our approach consistently outperforms both single- and multi-agent LLMs frameworks, including leading commercial models such as GPT-4o and DeepSeek-R1.

摘要

大语言模型(LLMs)在临床问答中展现出强大潜力,近期多智能体框架通过协作推理进一步提升了诊断准确性。然而,我们发现存在"沉默共识"的反复出现问题,即智能体在缺乏充分批判性分析的情况下过早达成诊断结论,尤其在复杂或模糊病例中。我们提出"鲶鱼智能体"这一新概念,该角色专用LLM旨在注入结构化异议以对抗沉默共识。受组织心理学中"鲶鱼效应"启发,鲶鱼智能体通过挑战既有共识来激发深度推理。我们构建了两种机制以实现有效且情境感知的干预:(1)基于病例难度调节智能体参与度的复杂度感知干预;(2)平衡批评与协作的语气校准干预。在九个医学问答和三个医学视觉问答基准测试上的评估表明,我们的方法始终优于单智能体和多智能体LLMs框架,包括GPT-4o和DeepSeek-R1等领先商业模型。


How does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective

Abstract

arXiv:2505.21505v1 Announce Type: cross Abstract: Multilingual Alignment is an effective and representative paradigm to enhance LLMs' multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some researches on language-specific neurons reveal that there are language-specific neurons that are selectively activated in LLMs when processing different languages. This provides a new perspective to analyze and understand LLMs' mechanisms more specifically in multilingual scenarios. In this work, we propose a new finer-grained neuron identification algorithm, which detects language neurons~(including language-specific neurons and language-related neurons) and language-agnostic neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs' internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ''Spontaneous Multilingual Alignment''. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights for better understanding multilingual alignment and multilingual capabilities of LLMs.

摘要

多语言对齐是增强大语言模型多语言能力的有效代表性范式,其将高资源语言的能力迁移至低资源语言。同时,特定语言神经元的相关研究表明,大语言模型在处理不同语言时会选择性激活具有语言特异性的神经元。这为更具体地分析和理解大语言模型在多语言场景下的工作机制提供了新视角。本研究提出了一种新的细粒度神经元识别算法,可检测语言神经元(包括语言特异性神经元和语言相关神经元)及语言无关神经元。基于不同类型神经元的分布特征,我们将大语言模型的多语言推理内部过程划分为四个部分:(1)多语言理解;(2)共享语义空间推理;(3)多语言输出空间转换;(4)词汇空间输出。此外,我们系统分析了模型在对齐前后各类神经元的变化,并研究了"自发性多语言对齐"现象。总体而言,本研究基于不同类型神经元展开了全面探究,为深入理解大语言模型的多语言对齐机制及多语言能力提供了实证结果与有价值的见解。


"Oh LLM, I'm Asking Thee, Please Give Me a Decision Tree": Zero-Shot Decision Tree Induction and Embedding with Large Language Models

Abstract

arXiv:2409.18594v2 Announce Type: replace Abstract: Large language models (LLMs) provide powerful means to leverage prior knowledge for predictive modeling when data is limited. In this work, we demonstrate how LLMs can use their compressed world knowledge to generate intrinsically interpretable machine learning models, i.e., decision trees, without any training data. We find that these zero-shot decision trees can even surpass data-driven trees on some small-sized tabular datasets and that embeddings derived from these trees perform better than data-driven tree-based embeddings on average. Our decision tree induction and embedding approaches can therefore serve as new knowledge-driven baselines for data-driven machine learning methods in the low-data regime. Furthermore, they offer ways to harness the rich world knowledge within LLMs for tabular machine learning tasks. Our code and results are available at https://github.com/ml-lab-htw/llm-trees.

摘要

当数据有限时,大语言模型(LLMs)为利用先验知识进行预测建模提供了强大工具。本研究展示了LLMs如何利用其压缩的世界知识生成本质上可解释的机器学习模型(即决策树),而无需任何训练数据。我们发现,在某些小规模表格数据集上,这些零样本决策树甚至能超越数据驱动的决策树,且基于这些树生成的嵌入表示平均优于数据驱动的树嵌入方法。因此,我们的决策树归纳和嵌入方法可作为低数据环境下数据驱动机器学习方法的新知识驱动基线。此外,这些方法为利用LLMs中丰富的世界知识处理表格机器学习任务提供了新途径。代码与结果详见https://github.com/ml-lab-htw/llm-trees。


DCA-Bench: A Benchmark for Dataset Curation Agents

Abstract

arXiv:2406.07275v2 Announce Type: replace Abstract: The quality of datasets plays an increasingly crucial role in the research and development of modern artificial intelligence (AI). Despite the proliferation of open dataset platforms nowadays, data quality issues, such as incomplete documentation, inaccurate labels, ethical concerns, and outdated information, remain common in widely used datasets. Furthermore, these issues are often subtle and difficult to be detected by rule-based scripts, therefore requiring identification and verification by dataset users or maintainers--a process that is both time-consuming and prone to human mistakes. With the surging ability of large language models (LLM), it's promising to streamline the discovery of hidden dataset issues with LLM agents. To achieve this, one significant challenge is enabling LLM agents to detect issues in the wild rather than simply fixing known ones. In this work, we establish a benchmark to measure LLM agent's ability to tackle this challenge. We carefully curate 221 real-world test cases from eight popular dataset platforms and propose an automatic evaluation framework using GPT-4o. Our proposed framework shows strong empirical alignment with expert evaluations, validated through extensive comparisons with human annotations. Without any hints, most competitive Curator agent can only reveal \sim30% of the data quality issues in the proposed dataset, highlighting the complexity of this task and indicating that applying LLM agents to real-world dataset curation still requires further in-depth exploration and innovation. The data and code are available at \href{https://github.com/TRAIS-Lab/dca-bench}{https://github.com/TRAIS-Lab/dca-bench}.

摘要

数据集质量在现代人工智能(AI)的研究与开发中扮演着日益关键的角色。尽管当前开放数据集平台数量激增,但文档缺失、标注错误、伦理问题及信息过时等数据质量问题在广泛使用的数据集中仍普遍存在。这些问题往往具有隐蔽性,难以通过基于规则的脚本检测,需要数据集使用者或维护者进行人工识别与验证——这一过程既耗时又容易出错。随着大语言模型(LLM)能力的飞速提升,利用LLM智能体系统性发现隐藏数据集问题具有广阔前景。实现该目标的核心挑战在于使LLM智能体能够主动发现未知问题,而非仅修复已知缺陷。本研究建立了一个基准测试来衡量LLM智能体应对该挑战的能力:我们从八个主流数据集平台精心筛选了221个真实测试案例,并提出了基于GPT-4o的自动化评估框架。实验表明,该框架与专家评估结果具有强一致性,这一结论通过大量人工标注对比得到验证。在没有提示的情况下,当前最优的Curator智能体仅能发现约30%的数据质量问题,既揭示了该任务的复杂性,也表明LLM智能体在实际数据集治理中的应用仍需深入探索与创新。数据与代码已开源于https://github.com/TRAIS-Lab/dca-bench。


A Generation Framework with Strict Constraints for Crystal Materials Design

Abstract

arXiv:2411.08464v2 Announce Type: replace Abstract: The design of crystal materials plays a critical role in areas such as new energy development, biomedical engineering, and semiconductors. Recent advances in data-driven methods have enabled the generation of diverse crystal structures. However, most existing approaches still rely on random sampling without strict constraints, requiring multiple post-processing steps to identify stable candidates with the desired physical and chemical properties. In this work, we present a new constrained generation framework that takes multiple constraints as input and enables the generation of crystal structures with specific chemical and properties. In this framework, intermediate constraints, such as symmetry information and composition ratio, are generated by a constraint generator based on large language models (LLMs), which considers the target properties. These constraints are then used by a subsequent crystal structure generator to ensure that the structure generation process is under control. Our method generates crystal structures with a probability of meeting the target properties that is more than twice that of existing approaches. Furthermore, nearly 100% of the generated crystals strictly adhere to predefined chemical composition, eliminating the risks of supply chain during production.

摘要

晶体材料的设计在新能源开发、生物医学工程和半导体等领域具有关键作用。近年来数据驱动方法的进展使得多样化的晶体结构生成成为可能。然而现有方法大多仍依赖无严格约束的随机采样,需要通过多重后处理步骤才能筛选出具有目标物化特性的稳定候选结构。本研究提出了一种新型约束生成框架,该框架以多重约束条件作为输入,能够生成具有特定化学组成和物性的晶体结构。该框架中,对称性信息和组成比例等中间约束条件由基于大语言模型(LLMs)的约束生成器产生,该生成器会综合考虑目标特性。这些约束条件随后被后续的晶体结构生成器采用,以确保结构生成过程处于可控状态。本方法生成符合目标特性晶体结构的概率是现有方法的两倍以上。此外,近100%的生成晶体严格遵循预设化学组成,彻底消除了生产过程中的供应链风险。


TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

Abstract

arXiv:2410.10479v2 Announce Type: replace Abstract: The rapid advancement of large language models has accelerated their application in reasoning, with strategic reasoning drawing increasing attention. To evaluate the strategic reasoning capabilities of LLMs, game theory, with its concise structure, has become the preferred approach for many researchers. However, current research typically focuses on a limited selection of games, resulting in low coverage of game types. Additionally, classic game scenarios carry risks of data leakage, and the benchmarks used often lack extensibility, rendering them inadequate for evaluating state-of-the-art models. To address these challenges, we propose TMGBench, characterized by comprehensive game type coverage, diverse scenarios and flexible game organization. Specifically, we incorporate all 144 game types summarized by the Robinson-Goforth topology of 2x2 games, constructed as classic games in our benchmark; we also synthetize diverse, higher-quality game scenarios for each classic game, which we refer to as story-based games. Lastly, to provide a sustainable evaluation framework adaptable to increasingly powerful LLMs, we treat the aforementioned games as atomic units and organize them into more complex forms through sequential, parallel, and nested structures. We conducted a comprehensive evaluation of mainstream LLMs, covering tests on rational reasoning, reasoning robustness, Theory-of-Mind capabilities, and reasoning in complex game forms. The results revealed LLMs still have flaws in the accuracy and consistency of strategic reasoning processes, and their levels of mastery over Theory-of-Mind also vary. Additionally, SOTA models like o3-mini, Qwen3 and deepseek-reasoner, were also evaluated across the sequential, parallel, and nested game structures while the results highlighted the challenges posed by TMGBench.

摘要

大型语言模型的快速发展加速了其在推理领域的应用,其中策略推理日益受到关注。为评估大语言模型的策略推理能力,具有简洁结构的博弈论已成为众多研究者的首选方法。然而当前研究通常局限于少量博弈类型,导致游戏类型覆盖率较低。此外,经典博弈场景存在数据泄露风险,且现有基准往往缺乏可扩展性,难以充分评估最先进的模型。针对这些挑战,我们提出TMGBench基准,其特点在于全面的博弈类型覆盖、多样化场景和灵活的游戏组织形式。具体而言,我们整合了Robinson-Goforth拓扑归纳的144种2x2博弈类型,构建为基准中的经典博弈;同时为每个经典博弈合成多样化、更高质量的情境化场景(即故事型博弈)。最后,为提供适应持续增强的大语言模型的可持续评估框架,我们将上述博弈视为原子单元,通过串行、并行和嵌套结构组织成更复杂的形式。我们对主流大语言模型进行了全面评估,涵盖理性推理、推理鲁棒性、心智理论能力及复杂博弈形式下的推理测试。结果表明大语言模型在策略推理过程的准确性和一致性方面仍存在缺陷,其心智理论掌握程度也参差不齐。此外,我们对o3-mini、Qwen3和deepseek-reasoner等前沿模型在串行、并行和嵌套博弈结构中的表现进行了评估,结果凸显了TMGBench基准提出的挑战。


Path Pooling: Training-Free Structure Enhancement for Efficient Knowledge Graph Retrieval-Augmented Generation

Abstract

arXiv:2503.05203v2 Announce Type: replace Abstract: Although Large Language Models achieve strong success in many tasks, they still suffer from hallucinations and knowledge deficiencies in real-world applications. Many knowledge graph-based retrieval-augmented generation (KG-RAG) methods enhance the quality and credibility of LLMs by leveraging structure and semantic information in KGs as external knowledge bases. However, these methods struggle to effectively incorporate structure information, either incurring high computational costs or underutilizing available knowledge. Inspired by smoothing operations in graph representation learning, we propose path pooling, a simple, training-free strategy that introduces structure information through a novel path-centric pooling operation. It seamlessly integrates into existing KG-RAG methods in a plug-and-play manner, enabling richer structure information utilization. Extensive experiments demonstrate that incorporating the path pooling into the state-of-the-art KG-RAG method consistently improves performance across various settings while introducing negligible additional cost.

摘要

尽管大型语言模型在许多任务中取得了显著成功,但在实际应用中仍存在幻觉和知识缺陷问题。许多基于知识图谱的检索增强生成(KG-RAG)方法通过利用知识图谱中的结构和语义信息作为外部知识库,提升了语言模型输出的质量与可信度。然而,这些方法难以有效整合结构信息,要么导致高昂的计算成本,要么未能充分利用现有知识。受图表示学习中平滑操作的启发,我们提出路径池化策略——这是一种无需训练的简单方法,通过新颖的以路径为中心的池化操作引入结构信息。该策略能以即插即用方式无缝集成到现有KG-RAG方法中,实现更丰富的结构信息利用。大量实验表明,将路径池化融入最先进的KG-RAG方法后,能在引入可忽略额外成本的同时,持续提升各类场景下的性能表现。


Leveraging Large Language Models for Active Merchant Non-player Characters

Abstract

arXiv:2412.11189v3 Announce Type: replace Abstract: We highlight two significant issues leading to the passivity of current merchant non-player characters (NPCs): pricing and communication. While immersive interactions with active NPCs have been a focus, price negotiations between merchant NPCs and players remain underexplored. First, passive pricing refers to the limited ability of merchants to modify predefined item prices. Second, passive communication means that merchants can only interact with players in a scripted manner. To tackle these issues and create an active merchant NPC, we propose a merchant framework based on large language models (LLMs), called MART, which consists of an appraiser module and a negotiator module. We conducted two experiments to explore various implementation options under different training methods and LLM sizes, considering a range of possible game environments. Our findings indicate that finetuning methods, such as supervised finetuning (SFT) and knowledge distillation (KD), are effective in using smaller LLMs to implement active merchant NPCs. Additionally, we found three irregular cases arising from the responses of LLMs.

摘要

我们指出导致当前商人类非玩家角色(NPC)被动性的两个关键问题:定价与交互机制。尽管沉浸式主动NPC交互一直是研究重点,但商人NPC与玩家之间的价格谈判机制仍缺乏深入探索。首先,被动定价表现为商人修改预设物品价格的能力受限;其次,被动交互意味着商人仅能通过脚本化方式与玩家互动。为解决这些问题并创建主动型商人NPC,我们提出基于大语言模型(LLM)的商人框架MART,其由评估模块和谈判模块构成。通过两项实验,我们在不同训练方法和LLM规模下探索了多种实现方案,并考虑了各类潜在游戏环境。研究发现,监督微调(SFT)和知识蒸馏(KD)等微调方法能有效利用小型LLM实现主动型商人NPC。此外,我们还发现了LLM响应产生的三种异常案例。


Exploring the Necessity of Reasoning in LLM-based Agent Scenarios

Abstract

arXiv:2503.11074v2 Announce Type: replace Abstract: The rise of Large Reasoning Models (LRMs) signifies a paradigm shift toward advanced computational reasoning. Yet, this progress disrupts traditional agent frameworks, traditionally anchored by execution-oriented Large Language Models (LLMs). To explore this transformation, we propose the LaRMA framework, encompassing nine tasks across Tool Usage, Plan Design, and Problem Solving, assessed with three top LLMs (e.g., Claude3.5-sonnet) and five leading LRMs (e.g., DeepSeek-R1). Our findings address four research questions: LRMs surpass LLMs in reasoning-intensive tasks like Plan Design, leveraging iterative reflection for superior outcomes; LLMs excel in execution-driven tasks such as Tool Usage, prioritizing efficiency; hybrid LLM-LRM configurations, pairing LLMs as actors with LRMs as reflectors, optimize agent performance by blending execution speed with reasoning depth; and LRMs' enhanced reasoning incurs higher computational costs, prolonged processing, and behavioral challenges, including overthinking and fact-ignoring tendencies. This study fosters deeper inquiry into LRMs' balance of deep thinking and overthinking, laying a critical foundation for future agent design advancements.

摘要

大型推理模型(LRMs)的崛起标志着计算推理领域向高阶范式转变。然而,这一进展打破了传统以执行为导向的大语言模型(LLMs)为核心的智能体框架。为探究此变革,我们提出LaRMA框架,涵盖工具使用、方案设计和问题解决三大类共九项任务,并评估三款顶尖LLM(如Claude3.5-sonnet)与五款领先LRM(如DeepSeek-R1)的表现。研究发现:LRMs在方案设计等推理密集型任务中凭借迭代反思机制优于LLMs;LLMs在工具使用等执行驱动型任务中效率更优;LLM-LRM混合架构(以LLMs为执行者、LRMs为反思者)能融合执行速度与推理深度,优化智能体性能;LRMs的增强推理能力伴随更高计算成本、更长处理时间及行为挑战(包括过度思考与忽视事实倾向)。本研究为深入探索LRMs深度思考与过度思考的平衡机制奠定基础,对未来智能体设计发展具有重要意义。


ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation

Abstract

arXiv:2501.06598v2 Announce Type: replace Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in chart understanding tasks. However, interpreting charts with textual descriptions often leads to information loss, as it fails to fully capture the dense information embedded in charts. In contrast, parsing charts into code provides lossless representations that can effectively contain all critical details. Although existing open-source MLLMs have achieved success in chart understanding tasks, they still face two major challenges when applied to chart-to-code tasks: (1) Low executability and poor restoration of chart details in the generated code and (2) Lack of large-scale and diverse training data. To address these challenges, we propose \textbf{ChartCoder}, the first dedicated chart-to-code MLLM, which leverages Code LLMs as the language backbone to enhance the executability of the generated code. Furthermore, we introduce \textbf{Chart2Code-160k}, the first large-scale and diverse dataset for chart-to-code generation, and propose the \textbf{Snippet-of-Thought (SoT)} method, which transforms direct chart-to-code generation data into step-by-step generation. Experiments demonstrate that ChartCoder, with only 7B parameters, surpasses existing open-source MLLMs on chart-to-code benchmarks, achieving superior chart restoration and code excitability. Our code is available at https://github.com/thunlp/ChartCoder.

摘要

多模态大语言模型(MLLMs)在图表理解任务中展现出卓越能力。然而,通过文本描述解释图表通常会导致信息丢失,因其无法完整捕捉图表中嵌入的密集信息。相比之下,将图表解析为代码可提供无损表示,从而有效包含所有关键细节。尽管现有开源MLLMs在图表理解任务中已取得成功,但在应用于图表到代码任务时仍面临两大挑战:(1)生成代码的可执行性低且图表细节还原度差;(2)缺乏大规模多样化训练数据。为解决这些问题,我们提出首个专用于图表到代码任务的MLLM模型\textbf{ChartCoder},其采用代码大语言模型作为语言主干以增强生成代码的可执行性。此外,我们构建首个大规模多样化图表到代码生成数据集\textbf{Chart2Code-160k},并提出\textbf{思维片段(SoT)}方法,将直接图表到代码的生成数据转化为分步生成。实验表明,仅含70亿参数的ChartCoder在图表到代码基准测试中超越现有开源MLLMs,实现了更优的图表还原与代码可执行性。代码已发布于https://github.com/thunlp/ChartCoder。


Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

Abstract

arXiv:2504.09762v2 Announce Type: replace Abstract: Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called "reasoning traces" or even "thoughts" -- implicitly anthropomorphizing the model, implying these tokens resemble steps a human might take when solving a challenging problem.In this paper, we present evidence that this anthropomorphization isn't a harmless metaphor, and instead is quite dangerous -- it confuses the nature of these models and how to use them effectively, and leads to questionable research.

摘要

中间标记生成(ITG)作为一种提升语言模型在推理任务上表现的方法被提出,其核心思想是模型在输出最终解决方案前先生成中间标记。这些中间标记常被称为"推理轨迹"或"思维"——这种表述隐含着对模型的人格化隐喻,暗示这些标记类似于人类解决复杂问题时的思考步骤。本文通过实证表明,这种人格化并非无害的比喻,反而具有相当危险性:它混淆了这些模型的本质及有效使用方法,并导致了值得商榷的研究取向。


Evaluating LLM Adaptation to Sociodemographic Factors: User Profile vs. Dialogue History

Abstract

arXiv:2505.21362v1 Announce Type: cross Abstract: Effective engagement by large language models (LLMs) requires adapting responses to users' sociodemographic characteristics, such as age, occupation, and education level. While many real-world applications leverage dialogue history for contextualization, existing evaluations of LLMs' behavioral adaptation often focus on single-turn prompts. In this paper, we propose a framework to evaluate LLM adaptation when attributes are introduced either (1) explicitly via user profiles in the prompt or (2) implicitly through multi-turn dialogue history. We assess the consistency of model behavior across these modalities. Using a multi-agent pipeline, we construct a synthetic dataset pairing dialogue histories with distinct user profiles and employ questions from the Value Survey Module (VSM 2013) (Hofstede and Hofstede, 2016) to probe value expression. Our findings indicate that most models adjust their expressed values in response to demographic changes, particularly in age and education level, but consistency varies. Models with stronger reasoning capabilities demonstrate greater alignment, indicating the importance of reasoning in robust sociodemographic adaptation.

摘要

大型语言模型(LLM)要实现有效互动,需根据用户的社会人口特征(如年龄、职业和教育水平)调整响应。虽然许多实际应用利用对话历史实现情境化,但现有对LLM行为适应的评估往往集中于单轮提示。本文提出一个评估框架,用于测试当社会属性通过(1)提示中的用户配置文件显式引入,或(2)多轮对话历史隐式引入时LLM的适应能力。我们评估模型在这些模态间的行为一致性。通过多智能体流程,我们构建了一个合成数据集,将对话历史与不同用户配置文件配对,并采用《价值观调查模块》(VSM 2013)(Hofstede和Hofstede,2016)中的问题来探测价值表达。研究发现,大多数模型会根据人口统计变化(尤其是年龄和教育水平)调整其表达的价值观,但一致性存在差异。具有更强推理能力的模型表现出更高的对齐性,这表明推理能力对于实现稳健的社会人口适应至关重要。


ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning

Abstract

arXiv:2503.09501v3 Announce Type: replace Abstract: Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking -- enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Additionally, we further extend ReMA to multi-turn interaction settings, leveraging turn-level ratio and parameter sharing to improve efficiency. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs. Our code can be found in https://github.com/ziyuwan/ReMA-public

摘要

近期关于大语言模型(LLMs)推理能力的研究试图通过整合元思维(meta-thinking)来进一步提升其性能——即让模型能够监控、评估并控制自身的推理过程,以实现更自适应且高效的问题解决。然而,当前的单智能体研究缺乏针对元思维获取的专门设计,导致效果不佳。为解决这一挑战,我们提出了强化元思维智能体(ReMA)这一新颖框架,该框架利用多智能体强化学习(MARL)来激发元思维行为,促使LLMs进行"关于思考的思考"。ReMA将推理过程解耦为两个层次化智能体:负责生成战略监督与规划的高层元思维智能体,以及执行具体细节推理的低层推理智能体。通过目标对齐的迭代强化学习,这些智能体探索并学会协作,从而提升泛化能力和鲁棒性。单轮实验的实证结果表明,在复杂推理任务(包括竞赛级数学基准和LLM-as-a-Judge基准)上,ReMA的表现优于单智能体强化学习基线方法。此外,我们进一步将ReMA扩展至多轮交互场景,利用轮次比例和参数共享提升效率。全面的消融研究进一步揭示了各智能体的动态演化过程,为理解元思维推理如何增强LLMs的推理能力提供了宝贵见解。代码详见:https://github.com/ziyuwan/ReMA-public


WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Abstract

arXiv:2306.08568v2 Announce Type: replace-cross Abstract: Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM

摘要

代码大语言模型(Code LLMs),如StarCoder,在代码相关任务中展现出卓越性能。然而,现有模型大多仅基于海量原始代码数据进行预训练,而未经过指令微调。本文提出WizardCoder,通过将Evol-Instruct方法适配至代码领域,实现了对Code LLMs的复杂指令微调。在四个主流代码生成基准测试(HumanEval、HumanEval+、MBPP和DS-1000)上的综合实验表明,我们的模型具有非凡能力:其性能显著超越所有其他开源Code LLMs。此外,在HumanEval和HumanEval+测试中,我们的模型甚至优于最大规模的闭源LLMs(Anthropic的Claude和Google的Bard)。代码、模型权重及数据已开源:https://github.com/nlpxucan/WizardLM


WizardLM: Empowering large pre-trained language models to follow complex instructions

Abstract

arXiv:2304.12244v3 Announce Type: replace-cross Abstract: Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at https://github.com/nlpxucan/WizardLM

摘要

利用开放域指令跟随数据训练大型语言模型(LLM)已取得巨大成功。然而,人工创建此类指令数据耗时费力,且人类难以生成高复杂度指令。本文提出一种利用LLM而非人工生成不同复杂度大规模指令数据的方法。基于初始指令集,我们采用提出的Evol-Instruct技术逐步将其重写为更复杂的指令,随后混合所有生成数据对LLMA进行微调,所得模型命名为WizardLM。在复杂度平衡测试集和Vicuna测试集上的人工评估表明,Evol-Instruct生成的指令优于人工创建指令。通过分析高复杂度部分的人工评估结果,我们发现WizardLM的输出优于OpenAI ChatGPT的输出。GPT-4自动评估显示,WizardLM在29项技能中有17项达到ChatGPT 90%以上的能力。尽管WizardLM在某些方面仍落后于ChatGPT,但研究表明基于AI进化指令的微调是增强LLM的有效方向。代码与数据已开源:https://github.com/nlpxucan/WizardLM


Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods

Abstract

arXiv:2401.16332v5 Announce Type: replace-cross Abstract: Language model alignment has become an important component of AI safety, allowing safe interactions between humans and language models, by enhancing desired behaviors and inhibiting undesired ones. It is often done by tuning the model or inserting preset aligning prompts. Recently, representation engineering, a method which alters the model's behavior via changing its representations post-training, was shown to be effective in aligning LLMs (Zou et al., 2023a). Representation engineering yields gains in alignment oriented tasks such as resistance to adversarial attacks and reduction of social biases, but was also shown to cause a decrease in the ability of the model to perform basic tasks. In this paper we study the tradeoff between the increase in alignment and decrease in helpfulness of the model. We propose a theoretical framework which provides bounds for these two quantities, and demonstrate their relevance empirically. First, we find that under the conditions of our framework, alignment can be guaranteed with representation engineering, and at the same time that helpfulness is harmed in the process. Second, we show that helpfulness is harmed quadratically with the norm of the representation engineering vector, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering. We validate our findings empirically, and chart the boundaries to the usefulness of representation engineering for alignment.

摘要

语言模型对齐已成为人工智能安全的重要组成部分,通过增强期望行为并抑制非期望行为,实现人类与语言模型的安全交互。现有方法通常通过模型调优或预设对齐提示实现。近期研究表明,表征工程(一种通过训练后改变模型表征来调整其行为的方法)能有效对齐大语言模型(Zou等,2023a)。该方法在抵抗对抗性攻击和减少社会偏见等对齐任务中表现优异,但也被发现会降低模型执行基础任务的能力。本文研究了模型对齐性提升与实用性下降之间的权衡关系,提出了量化这两项指标的理論框架,并通过实验验证其适用性。首先,我们发现框架条件下表征工程可确保对齐性,但同时会损害实用性;其次,我们证明实用性损害与表征工程向量的范数呈二次方关系,而对齐性提升与之呈线性关系,这揭示了表征工程的高效应用区间。我们通过实验验证了这些发现,并界定了表征工程在对齐任务中的有效适用范围。


Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Abstract

arXiv:2403.05518v2 Announce Type: replace-cross Abstract: Chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning. But CoT can also systematically misrepresent the factors influencing models' behavior -- for example, rationalizing answers in line with a user's opinion. We first create a new dataset of 9 different biases that affect GPT-3.5-Turbo and Llama-8b models. These consist of spurious-few-shot patterns, post hoc rationalization, and sycophantic settings. Models switch to the answer implied by the bias, without mentioning the effect of the bias in the CoT. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86% on held-out tasks. Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where ground truth reasoning is unavailable.

摘要

思维链提示(CoT)具有提升语言模型推理可解释性的潜力。但CoT也可能系统性歪曲影响模型行为的因素——例如生成符合用户观点的答案合理化。我们首先构建了包含9种影响GPT-3.5-Turbo和Llama-8b模型偏见的新数据集,涵盖虚假少样本模式、事后合理化及迎合性场景等类型。模型会转向偏见暗示的答案,且未在思维链中提及偏见影响。为缓解这种偏见推理问题,我们提出偏见增强一致性训练(BCT),这是一种无监督微调方案,通过训练模型在含/不含偏见特征的提示中保持一致性推理。我们在七项问答任务上构建了测试九种偏见推理形式的评估体系,发现对GPT-3.5-Turbo应用单偏见BCT可使保留任务的偏见推理率降低86%。此外,该模型能泛化至其他偏见形式,在保留偏见上平均减少37%的偏见推理。由于BCT具有对未知偏见的泛化能力且无需标注数据,该方法有望减少尚未发现偏见及无真实推理标注任务中的偏见推理。


T-REX: Mixture-of-Rank-One-Experts with Semantic-aware Intuition for Multi-task Large Language Model Finetuning

Abstract

arXiv:2404.08985v2 Announce Type: replace-cross Abstract: Large language models (LLMs) encounter significant adaptation challenges in diverse multitask finetuning. Mixture-of-experts (MoE) provides a promising solution with a dynamic architecture, enabling effective task decoupling. However, scaling up the number of MoE experts incurs substantial parameter and computational overheads and suffers from limited performance gain due to naive routing mechanisms. In this paper, we design a novel framework, mix\underline{\textbf{T}}ure\underline{\textbf{-}}of-\underline{\textbf{R}}ank-on\underline{\textbf{E}}-e\underline{\textbf{X}}perts (\texttt{T-REX}), which leverages the combination of ultra-low rank experts to construct LoRA weights on pretrained LLMs. The rank-1 experts enable a mix-and-match mechanism to quadratically expand the vector subspace of experts with linear parameter overheads, achieving approximate error reduction with optimal efficiency. In addition, T-REX offers implicit guidance to the router, leveraging the inherent semantic clustering of training embeddings as prior knowledge, enabling optimized feature allocation across experts for a smoother convergence. Extensive theoretical and empirical results demonstrate that T-REX achieves superior efficiency and generalizability across diverse tasks. Compared with other LoRA-based methods, T-REX achieves up to 1.78% mean accuracy improvement with around 30%-40% less trainable parameters across 14 public datasets. \href{https://github.com/RoyZry98/T-REX-Pytorch}{Code} is available.

摘要

大型语言模型(LLMs)在多样化多任务微调中面临显著的适应挑战。专家混合(MoE)通过动态架构提供了有前景的解决方案,能够实现有效的任务解耦。然而,增加MoE专家数量会导致参数量和计算开销大幅上升,且由于原始路由机制的限制,性能提升有限。本文提出了一种新颖框架——基于秩1专家混合的LoRA权重构建方法(T-REX),该框架利用超低秩专家的组合在预训练LLMs上构建LoRA权重。秩1专家通过混合匹配机制,以线性参数开销实现专家向量子空间的二次扩展,从而达到最优效率下的近似误差缩减。此外,T-REX通过利用训练嵌入的固有语义聚类作为先验知识,为路由器提供隐式指导,实现跨专家的优化特征分配,从而获得更平滑的收敛过程。大量理论与实证结果表明,T-REX在多样化任务中展现出卓越的效率和泛化能力。与其他基于LoRA的方法相比,T-REX在14个公开数据集上平均准确率最高提升1.78%,同时可训练参数减少约30%-40%。代码已开源。


An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment

Abstract

arXiv:2403.04963v3 Announce Type: replace-cross Abstract: Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs' simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models' performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation's reliability. To address these problems, this study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the LLMs' simplification capabilities. We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B. We believe that these models offer a representative selection across large, medium, and small sizes of LLMs. Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art. However, LLMs have their limitations, as seen in GPT-4's struggles with lexical paraphrasing. Results show that LLMs generally generate fewer erroneous simplification outputs compared to the previous state-of-the-art. However, LLMs have their limitations, as seen in GPT-4's and Qwen2.5-72B's struggle with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that these metrics lack sufficient sensitivity to assess the overall high-quality simplifications, particularly those generated by high-performance LLMs.

摘要

近期研究采用自动指标与人工评估相结合的方式衡量大语言模型(LLM)的文本简化能力。然而现有评估方法对LLM的适用性仍存疑点:其一,当前自动指标应用于LLM简化评估的适宜性尚未明确;其二,现有句子简化的人工评估方法往往陷入两个极端——或过于流于表面而无法清晰反映模型性能,或过度细化导致标注流程复杂且易出现不一致性,进而影响评估可靠性。为解决这些问题,本研究在确保评估可靠性的同时深入剖析LLM的简化表现。我们设计了一套基于错误分析的人工标注框架来评估LLM的简化能力,选取了包括GPT-4、Qwen2.5-72B和Llama-3.2-3B在内的闭源与开源模型,这些模型在大、中、小规模LLM中具有代表性。实验表明,与当前最优技术相比,LLM生成的简化输出普遍错误更少,但仍存在局限性,如GPT-4和Qwen2.5-72B在词汇释义方面表现欠佳。此外,我们基于人工标注结果对常用自动指标进行元评估,发现这些指标对整体高质量简化(尤其是高性能LLM生成的简化文本)缺乏足够的敏感度。


Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing

Abstract

arXiv:2406.14230v4 Announce Type: replace-cross Abstract: Warning: Contains harmful model outputs. Despite significant advancements, the propensity of Large Language Models (LLMs) to generate harmful and unethical content poses critical challenges. Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Although numerous benchmarks have been constructed to assess social bias, toxicity, and ethical issues in LLMs, those static benchmarks suffer from evaluation chronoeffect, in which, as models rapidly evolve, existing benchmarks may leak into training data or become saturated, overestimating ever-developing LLMs. To tackle this problem, we propose GETA, a novel generative evolving testing approach based on adaptive testing methods in measurement theory. Unlike traditional adaptive testing methods that rely on a static test item pool, GETA probes the underlying moral boundaries of LLMs by dynamically generating test items tailored to model capability. GETA co-evolves with LLMs by learning a joint distribution of item difficulty and model value conformity, thus effectively addressing evaluation chronoeffect. We evaluated various popular LLMs with GETA and demonstrated that 1) GETA can dynamically create difficulty-tailored test items and 2) GETA's evaluation results are more consistent with models' performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.

摘要

尽管大型语言模型(LLMs)已取得显著进展,但其生成有害和不道德内容的倾向仍构成严峻挑战。衡量LLMs的价值对齐性对其监管和负责任部署至关重要。虽然目前已构建众多基准来评估LLMs的社会偏见、毒性和伦理问题,但这些静态基准存在评估时滞效应——随着模型快速迭代,现有基准可能泄露至训练数据或趋于饱和,从而高估持续发展的LLMs。为解决该问题,我们提出GETA:一种基于测量理论自适应测试方法的新型生成式演化测试框架。与传统依赖静态试题库的自适应测试不同,GETA通过动态生成适配模型能力的测试题目,探测LLMs潜在的道德边界。该方法通过学习题目难度与模型价值遵从度的联合分布,实现与LLMs的协同演化,从而有效解决评估时滞效应。我们对多种主流LLMs进行GETA评估,结果表明:1) GETA能动态生成难度适配的测试题目;2) GETA评估结果与模型在未见OOD及独立同分布题目上的表现更具一致性,为未来评估范式奠定基础。


Sentiment Reasoning for Healthcare

Abstract

arXiv:2407.21054v4 Announce Type: replace-cross Abstract: Transparency in AI healthcare decision-making is crucial. By incorporating rationales to explain reason for each predicted label, users could understand Large Language Models (LLMs)'s reasoning to make better decision. In this work, we introduce a new task - Sentiment Reasoning - for both speech and text modalities, and our proposed multimodal multitask framework and the world's largest multimodal sentiment analysis dataset. Sentiment Reasoning is an auxiliary task in sentiment analysis where the model predicts both the sentiment label and generates the rationale behind it based on the input transcript. Our study conducted on both human transcripts and Automatic Speech Recognition (ASR) transcripts shows that Sentiment Reasoning helps improve model transparency by providing rationale for model prediction with quality semantically comparable to humans while also improving model's classification performance (+2% increase in both accuracy and macro-F1) via rationale-augmented fine-tuning. Also, no significant difference in the semantic quality of generated rationales between human and ASR transcripts. All code, data (five languages - Vietnamese, English, Chinese, German, and French) and models are published online: https://github.com/leduckhai/Sentiment-Reasoning

摘要

人工智能在医疗决策中的透明度至关重要。通过整合解释每个预测标签原因的理性依据,用户能够理解大语言模型(LLMs)的推理过程以做出更优决策。本研究提出了一项针对语音和文本模态的新任务——情感推理,并构建了多模态多任务框架及全球规模最大的多模态情感分析数据集。情感推理作为情感分析的辅助任务,要求模型根据输入文本预测情感标签并生成相应的推理依据。基于人工转录文本和自动语音识别(ASR)转录文本的实验表明:情感推理通过提供语义质量与人类相当的预测依据,不仅提升了模型透明度,还通过基于理性依据的微调使分类性能显著提升(准确率和宏观F1值均提高2%)。此外,人工转录与ASR转录生成依据的语义质量无显著差异。所有代码、数据(涵盖越南语、英语、汉语、德语和法语五种语言)及模型均已在线发布:https://github.com/leduckhai/Sentiment-Reasoning


Abstract

arXiv:2408.08105v4 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) have showcased exceptional Chain-of-Thought (CoT) reasoning ability in complex textual inference tasks including causal reasoning. However, will these causalities remain straightforward when crucial hints hide in visual details? If not, what factors might influence cross-modal generalization? Whether we can effectively enhance their capacity for robust causal inference across both text and vision? Motivated by these, we introduce MuCR - a novel Multimodal Causal Reasoning benchmark that leverages synthetic siamese images and text pairs to challenge MLLMs. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess MLLMs' comprehension abilities. Our experiments reveal that current MLLMs fall short in multimodal causal reasoning compared to their performance in purely textual settings. Additionally, we find that identifying visual cues across images is key to effective cross-modal generalization. Finally, we propose a VcCoT strategy that better highlights visual cues, and our results confirm its efficacy in enhancing multimodal causal reasoning. The project is available at: https://github.com/Zhiyuan-Li-John/MuCR

摘要

多模态大语言模型(MLLMs)在包括因果推理在内的复杂文本推理任务中展现出卓越的思维链(CoT)推理能力。然而,当关键线索隐藏于视觉细节时,这些因果关系是否仍能清晰呈现?若不能,哪些因素可能影响跨模态泛化?我们能否有效提升模型在文本与视觉双模态下的鲁棒因果推理能力?基于此,我们提出MuCR——一个创新的多模态因果推理基准,通过合成孪生图像-文本对来挑战MLLMs。同时,我们从图像级匹配、短语级理解和句子级解释等多维度开发定制化评估指标,全面衡量MLLMs的 comprehension 能力。实验表明,当前MLLMs在多模态因果推理上的表现显著落后于纯文本场景。此外,我们发现跨图像视觉线索的识别是实现有效跨模态泛化的关键。最后,我们提出VcCoT策略以强化视觉线索凸显,实验结果验证了该策略对增强多模态因果推理的有效性。


Can Large Language Models Understand Symbolic Graphics Programs?

Abstract

arXiv:2408.08313v4 Announce Type: replace-cross Abstract: Against the backdrop of enthusiasm for large language models (LLMs), there is a growing need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM's ability to answer semantic questions about the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to "imagine" and reason how the corresponding graphics content would look with only the symbolic description of the local curvatures and strokes. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability -- Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM's understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.

摘要

在大语言模型(LLMs)研究热潮的背景下,科学评估其能力与局限性的需求日益凸显。这一挑战的难点部分源于难以找到模型在训练中未接触过的测试任务。通过运用符号化图形程序,我们提出了一个适合全面测试LLMs空间语义推理能力的领域。这类在计算机图形学中广泛使用的程序化生成方法,能够按流程创建视觉数据。虽然LLMs在通用程序合成与分析方面展现出卓越能力,但符号化图形程序提供了新的评估维度:无需视觉编码器即可测试模型对图像或三维几何体语义问题的理解能力。要实现对符号程序的语义理解,LLMs必须具备仅凭局部曲率和笔触的符号描述就能'想象'并推理对应图形内容的能力。基于此,我们构建了一个大规模基准测试系统,用于评估LLMs对符号化图形程序的语义视觉理解能力,该系统通过程序化方式构建,极大减少了人工干预。研究重点聚焦于保持图像层级语义不变,同时对底层程序进行显著改动的变换操作。通过对商业和开源LLMs的基准测试,我们发现普遍具有更强推理能力的模型表现更优。最后,我们提出了一种创新性改进方法——符号指令微调(SIT),通过预收集的符号图形程序指令数据对LLM进行微调。值得注意的是,SIT不仅能提升LLMs对符号程序的理解能力,还能显著增强其在其他各类基准测试中的通用推理能力。


GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding

Abstract

arXiv:2409.04183v2 Announce Type: replace-cross Abstract: Programming languages possess rich semantic information - such as data flow - that is represented by graphs and not available from the surface form of source code. Recent code language models have scaled to billions of parameters, but model source code solely as text tokens while ignoring any other structural information. Conversely, models that do encode structural information of code make modifications to the Transformer architecture, limiting their scale and compatibility with pretrained LLMs. In this work, we take the best of both worlds with GALLa - Graph Aligned Large Language Models. GALLa utilizes graph neural networks and cross-modal alignment technologies to inject the structural information of code into LLMs as an auxiliary task during finetuning. This framework is both model-agnostic and task-agnostic, as it can be applied to any code LLM for any code downstream task, and requires the structural graph data only at training time from a corpus unrelated to the finetuning data, while incurring no cost at inference time over the baseline LLM. Experiments on five code tasks with seven different baseline LLMs ranging in size from 350M to 14B validate the effectiveness of GALLa, demonstrating consistent improvement over the baseline, even for powerful models such as LLaMA3 and Qwen2.5-Coder.

摘要

编程语言蕴含丰富的语义信息(如数据流),这些信息通过图结构表示且无法从源代码表层形式直接获取。当前代码语言模型已扩展至数十亿参数规模,但仅将源代码视为文本符号进行处理,忽略了其他结构信息。而编码代码结构信息的模型则需对Transformer架构进行修改,限制了其规模及与预训练大语言模型的兼容性。本研究提出GALLa(图对齐大语言模型)框架,融合双方优势:通过图神经网络和跨模态对齐技术,在微调阶段以辅助任务形式将代码结构信息注入大语言模型。该框架兼具模型无关性与任务无关性——可应用于任意代码大语言模型处理各类下游任务,且仅需在训练时从与微调数据无关的语料库中获取结构图数据,推理阶段相较基线大语言模型不会产生额外成本。在五个代码任务上的实验(涉及七种基线大语言模型,规模从3.5亿到140亿参数)验证了GALLa的有效性:即使对于LLaMA3和Qwen2.5-Coder等强大模型,相较基线模型仍能实现稳定性能提升。


Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors

Abstract

arXiv:2410.13776v4 Announce Type: replace-cross Abstract: In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs). The knowledge acquired during pre-training is crucial for this few-shot capability, providing the model with task priors. However, recent studies have shown that ICL predominantly relies on retrieving task priors rather than "learning" to perform tasks. This limitation is particularly evident in complex subjective domains such as emotion and morality, where priors significantly influence posterior predictions. In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt. Moreover, we evaluate the posterior bias towards certain annotators by grounding our study in appropriate, quantitative measures of LLM priors. Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead. However, aggregation does not explain the entire gap between ICL and the state of the art, meaning other factors in such tasks also account for the observed phenomena. Finally, by rigorously studying annotator-level labels, we find that it is possible for minority annotators to both better align with LLMs and have their perspectives further amplified.

摘要

情境学习(ICL)已成为大型语言模型(LLM)处理自然语言任务的主要方法。预训练阶段获取的知识对于这种小样本学习能力至关重要,它为模型提供了任务先验。然而,近期研究表明ICL主要依赖任务先验的检索而非真正"学习"任务执行。这一局限在情感、道德等复杂主观领域尤为明显,因为先验会显著影响后验预测。本文通过实证分析发现,这种现象可能源于相关数据集采用的聚合方法——当试图合并低一致性、差异显著的标注时,可能产生提示模板中的有害标注噪声。我们基于定量化的LLM先验测量指标,进一步评估了模型对特定标注者的后验偏好。结果表明:聚合方法是主观任务建模中的混杂因素,建议转向个体层面的建模。但聚合方法并不能完全解释ICL与最优性能间的差距,说明此类任务中还存在其他影响因素。最后,通过对标注者层级标签的严格分析,我们发现少数派标注者既可能与LLM更对齐,也可能使其观点被进一步放大。


Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs

Abstract

arXiv:2410.14641v2 Announce Type: replace-cross Abstract: Positional bias in large language models (LLMs) hinders their ability to effectively process long inputs. A prominent example is the "lost in the middle" phenomenon, where LLMs struggle to utilize relevant information situated in the middle of the input. While prior research primarily focuses on single pieces of relevant information, real-world applications often involve multiple relevant information pieces. To bridge this gap, we present LongPiBench, a benchmark designed to assess positional bias involving multiple pieces of relevant information. Thorough experiments are conducted with five commercial and six open-source models. These experiments reveal that while most current models are robust against the "lost in the middle" issue, there exist significant biases related to the spacing of relevant information pieces. These findings highlight the importance of evaluating and reducing positional biases to advance LLM's capabilities.

摘要

大语言模型(LLMs)中的位置偏差阻碍了其有效处理长输入的能力。一个突出的例子是"迷失在中间"现象,即LLMs难以利用位于输入中间位置的相关信息。尽管先前研究主要关注单一相关信息片段,但实际应用往往涉及多个相关信息片段。为填补这一空白,我们提出了LongPiBench基准测试,旨在评估涉及多个相关信息片段的位置偏差。我们对五种商业模型和六种开源模型进行了全面实验。这些实验表明,尽管当前大多数模型对"迷失在中间"问题具有鲁棒性,但仍存在与相关信息片段间距相关的显著偏差。这些发现凸显了评估和减少位置偏差对于提升LLM能力的重要性。


Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Abstract

arXiv:2410.01651v3 Announce Type: replace-cross Abstract: Despite the success of Transformers, handling long contexts remains challenging due to the limited length generalization and quadratic complexity of self-attention. Thus Transformers often require post-training with a larger attention window, significantly increasing computational and memory costs. In this paper, we propose a novel attention mechanism based on dynamic context, Grouped Cross Attention (GCA), which can generalize to 1000 times the pre-training context length while maintaining the ability to access distant information with a constant attention window size. For a given input sequence, we split it into chunks and use each chunk to retrieve top-k relevant past chunks for subsequent text generation. Specifically, unlike most previous works that use an off-the-shelf retriever, our key innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. Such a mechanism accommodates retrieved chunks with a fixed-size attention window to achieve long-range information access, significantly reducing computational and memory costs during training and inference. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is 1000 times the training length.

摘要

尽管Transformer模型取得了成功,但由于长度泛化能力有限和自注意力机制的二次方复杂度,处理长上下文仍然具有挑战性。因此Transformer通常需要进行更大注意力窗口的后训练,这会显著增加计算和内存开销。本文提出了一种基于动态上下文的新型注意力机制——分组交叉注意力(GCA),该机制能够泛化到预训练上下文长度1000倍的场景,同时通过固定大小的注意力窗口保持访问远端信息的能力。对于给定的输入序列,我们将其分割为多个数据块,并利用每个数据块检索前k个最相关的历史数据块用于后续文本生成。与大多数现有工作使用现成检索器不同,我们的核心创新在于让检索器能够以端到端方式学习如何检索能更好最小化后续token自回归损失的历史数据块。这种机制通过固定大小的注意力窗口容纳检索到的数据块,实现长距离信息访问,显著降低了训练和推理时的计算与内存成本。实验表明,基于GCA的模型在1600万上下文长度的密码检索任务中实现了接近完美的准确率,这是训练长度的1000倍。


Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch

Abstract

arXiv:2410.18693v2 Announce Type: replace-cross Abstract: Improving the mathematical reasoning capabilities of Large Language Models (LLMs) is critical for advancing artificial intelligence. However, access to extensive, diverse, and high-quality reasoning datasets remains a significant challenge, particularly for the open-source community. In this paper, we propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method that enables the generation of large-scale mathematical reasoning datasets using lightweight 7B-scale models. ScaleQuest introduces a two-stage question-tuning process comprising Question Fine-Tuning (QFT) and Question Preference Optimization (QPO) to unlock the question generation capabilities of problem-solving models. By generating diverse questions from scratch -- without relying on powerful proprietary models or seed data -- we produce a dataset of 1 million problem-solution pairs. Our experiments demonstrate that models trained on our data outperform existing open-source datasets in both in-domain and out-of-domain evaluations. Furthermore, our approach shows continued performance improvement as the volume of training data increases, highlighting its potential for ongoing data scaling. The extensive improvements observed in code reasoning tasks demonstrate the generalization capabilities of our proposed method. Our work provides the open-source community with a practical solution to enhance the mathematical reasoning abilities of LLMs.

摘要

提升大型语言模型(LLMs)的数学推理能力对人工智能发展至关重要。然而,获取大规模、多样化且高质量的推理数据集仍面临重大挑战,尤其对开源社区而言。本文提出ScaleQuest——一种新颖、可扩展且经济高效的数据合成方法,该方法利用轻量级7B规模模型实现大规模数学推理数据集的生成。ScaleQuest采用包含问题微调(QFT)与问题偏好优化(QPO)的两阶段问题调优流程,从而释放解题模型的问题生成能力。通过完全从零生成多样化问题(不依赖强大专有模型或种子数据),我们构建了包含100万道问题-解决方案对的数据集。实验表明,基于本数据训练的模型在领域内与跨领域评估中均优于现有开源数据集。此外,该方法随着训练数据量增加持续展现性能提升,凸显其持续扩展数据的潜力。在代码推理任务中观察到的显著改进证实了本方法的泛化能力。本研究为开源社区提供了增强LLMs数学推理能力的实用解决方案。


EPIC: Efficient Position-Independent Caching for Serving Large Language Models

Abstract

arXiv:2410.15332v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-Independent Caching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate "attention sink" effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8x improvements in Time-To-First-Token (TTFT) and 7x throughput gains over existing systems, with negligible or no accuracy loss.

摘要

大型语言模型(LLMs)在广泛的应用中展现出强大能力,但随着请求(提示)复杂度增加,其高效服务变得愈发具有挑战性。上下文缓存通过复用键值(KV)向量——即跨请求重复出现的令牌中间表示——来提升服务性能。然而现有上下文缓存技术要求请求间必须存在完全匹配的前缀,这限制了少样本学习和检索增强生成等场景中的复用机会,此类场景中不可变内容(如文档)在跨请求时保持不变,但会前置不同前缀。位置无关缓存(PIC)通过实现KV向量的模块化复用(不受前缀影响)解决了该问题。我们系统化阐述了PIC概念,并通过引入EPIC系统推进了现有研究。EPIC整合了我们提出的LegoLink算法,该算法通过消除每个文档起始处不恰当的'注意力汇聚'效应,以最小计算代价保持准确性。实验表明,相较于现有系统,EPIC在首令牌响应时间(TTFT)上实现最高8倍提升,吞吐量获得7倍增益,且准确率损失可忽略或为零。


Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing

Abstract

arXiv:2410.06638v4 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have exhibited strong mathematical reasoning prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle yet critical errors, such as miscalculations or incorrect substitutions, limit the LLMs' full potential. Existing studies to improve mathematical ability typically involve applying preference learning to step-wise solution pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook critical subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into pivotal tokens in reasoning or computation steps to construct hard pairs for error mitigation. In detail, RISE uses the LLM itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples. Moreover, the effect of error mitigation extends from mathematical reasoning to logical reasoning and code generation.

摘要

大语言模型(LLMs)已展现出强大的数学推理能力,能够处理从基础算术到竞赛级难题的各类任务。然而,频繁出现的细微但关键的错误(如计算失误或替换错误)限制了其潜力。现有提升数学能力的研究通常通过对逐步解答样本对进行偏好学习来实现。尽管这些方法利用不同粒度的样本来减少推理错误,却忽视了关键的细微错误。本研究提出了一种名为"错误注入自编辑"(RISE)的新型偏好学习框架,通过将预定义的细微错误注入推理或计算步骤的关键标记中,构建用于错误缓解的困难样本对。具体而言,RISE利用大语言模型自身对解答中的少量标记进行编辑,注入设计的细微错误。随后,将自编辑解答与对应正确解答构成的样本对,以及通过采样获得的正确与错误解答对共同用于细微错误感知的DPO训练。与其他偏好学习方法相比,RISE进一步优化了训练目标,且无需细粒度采样或偏好标注。大量实验验证了RISE的有效性:在Qwen2-7B-Instruct模型上仅使用4.5K训练样本进行偏好学习,即可在GSM8K和MATH数据集上分别实现3.0%和7.9%的显著提升。此外,错误缓解的效果可延伸至逻辑推理和代码生成领域。


RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Abstract

arXiv:2411.15114v2 Announce Type: replace-cross Abstract: Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics -- e.g. an agent wrote a faster custom Triton kernel than any of our human experts' -- and can generate and test solutions over ten times faster than humans, at much lower cost. We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.

摘要

前沿AI安全政策强调,AI代理自动化AI研发(R&D)是需要预见的重要能力。然而,目前对AI研发能力的评估较少,且缺乏高度真实性并与人类表现直接对比的研究。我们提出RE-Bench(研究工程基准测试v1),包含7个具有挑战性的开放式机器学习研究工程环境,以及61位不同人类专家进行的71次8小时尝试数据。我们证实,专家在8小时内能在这些环境中取得进展——82%的专家尝试获得非零分数,24%达到或超过我们的强参考解决方案。通过不同时间预算和代理设计的k次最佳选择,我们将人类与多个前沿公共模型进行对比,发现当双方在每个环境总预算为2小时时,最佳AI代理的得分是人类的4倍。但人类目前展现出更好的时间预算回报率——在8小时预算下略微超过顶级AI代理得分,在32小时总预算(跨不同尝试)下达到顶级AI代理得分的2倍。定性分析表明,现代AI代理在诸多机器学习领域具备显著专长(例如某代理编写的定制Triton内核比所有人类专家更快),且能以十倍于人类的速度生成和测试解决方案,成本更低。我们开源了评估环境、专家数据、分析代码和代理轨迹以促进未来研究。


ProgCo: Program Helps Self-Correction of Large Language Models

Abstract

arXiv:2501.01264v2 Announce Type: replace-cross Abstract: Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and leading to the failure of self-correction, especially in complex reasoning tasks. In this paper, we propose Program-driven Self-Correction (ProgCo). First, program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-generated, self-executing verification pseudo-programs. Then, program-driven refinement (ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement on both responses and verification programs to mitigate misleading of incorrect feedback in complex reasoning tasks. Experiments on three instruction-following and mathematical benchmarks indicate that ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools. We release our code at https://github.com/songxiaoshuai/progco.

摘要

自我校正旨在使大型语言模型(LLMs)能够在无需外部反馈的情况下,对初始响应进行自我验证与自我优化。然而,LLMs往往难以有效执行自我验证并生成正确反馈,进而误导优化过程导致校正失败,尤其在复杂推理任务中表现显著。本文提出程序驱动式自我校正框架(ProgCo):首先,程序驱动验证(ProgVe)通过自生成、自执行的验证伪程序实现复杂验证逻辑与广泛验证;随后,程序驱动优化(ProgRe)接收ProgVe反馈,对响应和验证程序进行双重反思与优化,以减轻复杂推理任务中错误反馈的误导性。在三个指令跟随与数学基准测试上的实验表明,ProgCo能实现有效自我校正,且结合真实程序工具可进一步提升性能。代码已发布于https://github.com/songxiaoshuai/progco。


Tuning LLM Judge Design Decisions for 1/1000 of the Cost

Abstract

arXiv:2501.17178v4 Announce Type: replace-cross Abstract: Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trade accuracy for cost and also significantly reduce the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility. The code to reproduce our experiments is available at this repository https://github.com/geoalgo/judgetuning .

摘要

评估大型语言模型(LLMs)通常需要昂贵的人工标注。为解决这一问题,研究者提出了基于LLM的评判器,通过比较两个LLM的输出实现无需人工干预的模型排序。尽管已有多种方法被提出,但不同研究之间存在诸多混杂因素。例如模型、提示词及其他超参数通常同时变更,导致难以进行直接对比。本文提出系统分析和调优LLM评判器超参数的方法。为降低评判器评估的高成本,我们采用多目标多保真度优化技术,可找到在准确性与成本间取得平衡的评判器,并显著降低搜索成本。我们的方法不仅发现了在准确性和成本效益上超越现有基准的评判器,还采用开放权重模型,确保了更高的可获取性和可复现性。实验代码详见https://github.com/geoalgo/judgetuning。


Abstract

arXiv:2501.18922v2 Announce Type: replace-cross Abstract: Knowledge Base Question Answering (KBQA) aims to answer natural language questions with a large-scale structured knowledge base (KB). Despite advancements with large language models (LLMs), KBQA still faces challenges in weak KB awareness, imbalance between effectiveness and efficiency, and high reliance on annotated data. To address these challenges, we propose KBQA-o1, a novel agentic KBQA method with Monte Carlo Tree Search (MCTS). It introduces a ReAct-based agent process for stepwise logical form generation with KB environment exploration. Moreover, it employs MCTS, a heuristic search method driven by policy and reward models, to balance agentic exploration's performance and search space. With heuristic exploration, KBQA-o1 generates high-quality annotations for further improvement by incremental fine-tuning. Experimental results show that KBQA-o1 outperforms previous low-resource KBQA methods with limited annotated data, boosting Llama-3.1-8B model's GrailQA F1 performance to 78.5% compared to 48.5% of the previous sota method with GPT-3.5-turbo. Our code is publicly available.

摘要

知识库问答(KBQA)旨在利用大规模结构化知识库(KB)回答自然语言问题。尽管大语言模型(LLM)取得了进展,KBQA仍面临知识库感知能力弱、效果与效率失衡、对标注数据依赖性强等挑战。为解决这些问题,我们提出KBQA-o1——一种结合蒙特卡洛树搜索(MCTS)的新型智能体KBQA方法。该方法引入基于ReAct的智能体流程,通过知识库环境探索逐步生成逻辑形式;同时采用由策略模型和奖励模型驱动的启发式搜索方法MCTS,以平衡智能体探索的性能与搜索空间。通过启发式探索,KBQA-o1能生成高质量标注数据用于增量微调改进。实验表明,在标注数据有限的情况下,KBQA-o1优于现有低资源KBQA方法:Llama-3.1-8B模型在GrailQA上的F1值达到78.5%,较之前采用GPT-3.5-turbo的SOTA方法(48.5%)显著提升。代码已开源。


More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives

Abstract

arXiv:2501.04070v3 Announce Type: replace-cross Abstract: Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as ICL demonstrations increase from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce \textit{DrICL}, a novel optimization method that enhances model performance through \textit{Differentiated} and \textit{Reweighting} objectives. Globally, DrICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby mitigating the impact of noisy data. Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the \textit{Many-Shot ICL Benchmark} (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for both fine-tuning and evaluation purposes. Experimental results demonstrate that LLMs enhanced with DrICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios. We release the code and dataset hoping to facilitate further research in many-shot ICL\footnote{https://github.com/xiaoqzhwhu/DrICL}.

摘要

大型语言模型(LLMs)在无需参数更新的情况下,能够出色地完成少样本上下文学习(ICL)。然而,当ICL演示样本从少量增加到大量时,模型性能往往趋于稳定并最终下降。我们发现这一趋势主要由两个原因导致:次优的负对数似然(NLL)优化目标和递增的数据噪声。为解决这些问题,我们提出了一种新型优化方法——\textit{DrICL},该方法通过\textit{差异化}和\textit{重加权}目标来提升模型性能。在全局层面,DrICL采用差异化学习优化NLL目标,确保多样本性能超越零样本水平;在局部层面,该方法受强化学习累积优势启发,动态调整多样本演示的权重,从而降低噪声数据的影响。针对现有数据集中缺乏多样本分布多任务数据的问题,我们构建了\textit{多样本ICL基准测试集}(ICL-50)——一个包含50个任务的大规模基准数据集,覆盖1至350个样本量级,序列长度最高达8,000个标记,可用于模型微调和评估。实验结果表明,经DrICL增强的LLMs在各类任务(包括领域内和领域外场景)的多样本设置中均取得显著性能提升。我们公开了代码和数据集以促进多样本ICL的后续研究\footnote{https://github.com/xiaoqzhwhu/DrICL}。


Unraveling Indirect In-Context Learning Using Influence Functions

Abstract

arXiv:2501.01473v2 Announce Type: replace-cross Abstract: In this work, we introduce a novel paradigm for generalized In-Context Learning (ICL), termed Indirect In-Context Learning. In Indirect ICL, we explore demonstration selection strategies tailored for two distinct real-world scenarios: Mixture of Tasks and Noisy ICL. We systematically evaluate the effectiveness of Influence Functions (IFs) as a selection tool for these settings, highlighting the potential of IFs to better capture the informativeness of examples within the demonstration pool. For the Mixture of Tasks setting, demonstrations are drawn from 28 diverse tasks, including MMLU, BigBench, StrategyQA, and CommonsenseQA. We demonstrate that combining BertScore-Recall (BSR) with an IF surrogate model can further improve performance, leading to average absolute accuracy gains of 0.37% and 1.45% for 3-shot and 5-shot setups when compared to traditional ICL metrics. In the Noisy ICL setting, we examine scenarios where demonstrations might be mislabeled or have adversarial noise. Our experiments show that reweighting traditional ICL selectors (BSR and Cosine Similarity) with IF-based selectors boosts accuracy by an average of 2.90% for Cosine Similarity and 2.94% for BSR on noisy GLUE benchmarks. For the adversarial sub-setting, we show the utility of using IFs for task-agnostic demonstration selection for backdoor attack mitigation. Showing a 32.89% reduction in Attack Success Rate compared to task-aware methods. In sum, we propose a robust framework for demonstration selection that generalizes beyond traditional ICL, offering valuable insights into the role of IFs for Indirect ICL.

摘要

在本研究中,我们提出了一种新型的广义上下文学习范式——间接上下文学习。针对混合任务和噪声上下文学习这两种现实场景,我们系统性地开发了相应的示例选择策略。通过重点研究影响函数作为选择工具的效能,我们发现影响函数能更有效地捕捉示例池中样本的信息价值。在混合任务场景中,我们从28项多样化任务(包括MMLU、BigBench、StrategyQA和CommonsenseQA)选取示例。实验表明,将BertScore-Recall与影响函数代理模型相结合可进一步提升性能:相较于传统上下文学习指标,在3样本和5样本设置下分别实现0.37%和1.45%的平均绝对准确率提升。针对噪声上下文学习场景,我们研究了示例误标记和对抗性噪声的情况。实验证明,基于影响函数的权重调整策略能使传统选择器(余弦相似度和BertScore-Recall)在噪声GLUE基准上的准确率平均提升2.90%(余弦相似度)和2.94%(BertScore-Recall)。在对抗性子集实验中,我们发现影响函数可实现与任务无关的示例选择,有效降低后门攻击成功率——较任务感知方法实现32.89%的攻击成功率降幅。本研究最终构建了一个鲁棒的示例选择框架,其通用性超越传统上下文学习方法,并为影响函数在间接上下文学习中的作用提供了重要见解。


TransMLA: Migrating GQA Models to MLA with Full DeepSeek Compatibility and Speedup

Abstract

arXiv:2502.07864v3 Announce Type: replace-cross Abstract: In this paper, we present TransMLA, a framework that seamlessly converts any GQA-based pre-trained model into an MLA-based model. Our approach enables direct compatibility with DeepSeek's codebase, allowing these models to fully leverage DeepSeek-specific optimizations such as vLLM and SGlang. By compressing 93% of the KV cache in LLaMA-2-7B, TransMLA achieves a 10.6x inference speedup at an 8K context length while preserving meaningful output quality. Additionally, the model requires only 6 billion tokens for fine-tuning to regain performance on par with the original across multiple benchmarks. TransMLA offers a practical solution for migrating GQA-based models to the MLA structure. When combined with DeepSeek's advanced features, such as FP8 quantization and Multi-Token Prediction, even greater inference acceleration can be realized.

摘要

本文提出TransMLA框架,该框架能够将任何基于GQA的预训练模型无缝转换为基于MLA的模型。我们的方法实现了与DeepSeek代码库的直接兼容,使这些模型能够充分利用DeepSeek特有的优化技术,如vLLM和SGlang。通过在LLaMA-2-7B模型中压缩93%的KV缓存,TransMLA在8K上下文长度下实现了10.6倍的推理加速,同时保持了有意义的输出质量。此外,该模型仅需60亿标记进行微调即可在多个基准测试中恢复至原始模型性能水平。TransMLA为将基于GQA的模型迁移至MLA结构提供了实用解决方案。当结合DeepSeek的高级特性(如FP8量化和多标记预测)时,还能实现更显著的推理加速。


Training a Generally Curious Agent

Abstract

arXiv:2502.17543v3 Announce Type: replace-cross Abstract: Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present Paprika, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.

摘要

高效探索对于与环境交互的智能系统至关重要,但现有语言模型在需要策略性信息收集的场景中往往表现欠佳。本文提出Paprika——一种微调方法,使语言模型能够发展不局限于特定环境的通用决策能力。通过训练来自不同任务(需要多样化策略)的合成交互数据,Paprika教会模型基于上下文环境反馈在新任务中探索并调整行为,而无需进一步梯度更新。实验结果表明,经Paprika微调的模型能有效将习得的决策能力迁移至完全未见过的任务,且无需额外训练。与传统训练不同,本方法的主要瓶颈在于采样有用交互数据而非模型更新。为提高样本效率,我们提出课程学习策略,优先从具有高学习潜力的任务中采样轨迹。这些结果为开发能自主解决需要与外部世界交互的新型序列决策问题的人工智能系统指明了一条可行路径。


Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text

Abstract

arXiv:2502.12953v2 Announce Type: replace-cross Abstract: Masked language modeling has become a widely adopted unsupervised technique to pre-train large language models (LLMs). However, the process of selecting tokens for masking is random, and the percentage of masked tokens is typically fixed for the entire training process. In this paper, we propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme. First, we harness task-specific knowledge about useful and harmful tokens in order to determine which tokens to mask. Second, we propose a cyclic decaying masking ratio, which corresponds to an anti-curriculum schedule (from hard to easy). We exemplify our novel task-informed anti-curriculum by masking (TIACBM) approach across three diverse downstream tasks: sentiment analysis, text classification by topic, and authorship attribution. Our findings suggest that TIACBM enhances the ability of the model to focus on key task-relevant features, contributing to statistically significant performance gains across tasks. We release our code at https://github.com/JarcaAndrei/TIACBM.

摘要

掩码语言建模已成为预训练大语言模型(LLM)时广泛采用的无监督技术。然而,现有方法中掩码标记的选择过程是随机的,且掩码比例通常在训练全程固定不变。本文提出一种基于任务信息反课程学习的新方案,用于动态调整掩码比例并决策掩码标记的选择。首先,我们利用任务相关知识区分关键标记与干扰标记以指导掩码决策;其次,设计周期性衰减的掩码比例,对应从难到易的反课程调度策略。我们通过情感分析、主题文本分类和作者归属三个下游任务验证了所提出的任务信息反课程掩码(TIACBM)方法。实验表明,TIACBM能有效增强模型对任务关键特征的聚焦能力,在所有任务中均取得统计显著的性能提升。代码已发布于https://github.com/JarcaAndrei/TIACBM。


Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework

Abstract

arXiv:2502.18874v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.

摘要

大型语言模型(LLM)在各类场景中的自动化评估应用日益广泛。先前研究试图通过微调开源LLM来复现GPT-4等强效专有模型的评估解释与判断,但这些方法主要局限于预定义通用标准下的文本分析,导致对未见指令的适应性降低,且在评估量化与结构约束遵循性时表现出不稳定性。为解决这些局限,我们提出新型评估框架ARJudge,其能自适应制定评估标准,并综合文本与代码驱动分析来评估LLM响应。ARJudge包含两个组件:生成多维度评估分析的微调分析器(Analyzer),以及无需调优、通过整合优化所有分析做出最终判定的精炼器(Refiner)。我们构建了复合分析语料库(Composite Analysis Corpus),集成评估标准生成任务与文本/代码驱动分析生成任务以训练分析器。实验结果表明,ARJudge在效能与鲁棒性上优于现有微调评估器,同时验证了多维度评估与代码驱动分析对提升评估能力的重要性。


GeLLMO: Generalizing Large Language Models for Multi-property Molecule Optimization

Abstract

arXiv:2502.13398v2 Announce Type: replace-cross Abstract: Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs' potential for molecule optimization, we introduce MuMOInstruct, the first high-quality instruction-tuning dataset specifically focused on complex multi-property molecule optimization tasks. Leveraging MuMOInstruct, we develop GeLLMOs, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that GeLLMOs consistently outperform state-of-the-art baselines. GeLLMOs also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of GeLLMOs as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. MuMOInstruct, models, and code are accessible through https://github.com/ninglab/GeLLMO.

摘要

尽管近期取得进展,大多数分子优化的计算方法仍局限于单属性或双属性优化任务,且存在可扩展性差、对新优化任务泛化能力不足的问题。与此同时,大型语言模型(LLMs)在新任务上展现出卓越的跨领域泛化能力。为验证LLMs在分子优化中的潜力,我们提出了MuMOInstruct——首个专注于复杂多属性分子优化任务的高质量指令微调数据集。基于MuMOInstruct,我们开发了GeLLMOs系列指令微调LLMs用于分子优化。在5个领域内任务和5个跨领域任务上的广泛评估表明,GeLLMOs持续优于现有最先进基线模型。GeLLMOs对未见任务还展现出突出的零样本泛化能力,显著优于强大的闭源LLMs。这种强大的泛化能力证明了GeLLMOs作为分子优化基础模型的巨大潜力,从而无需资源密集的重新训练即可应对新型优化任务。MuMOInstruct数据集、模型及代码可通过https://github.com/ninglab/GeLLMO获取。


Thinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement

Abstract

arXiv:2502.17442v2 Announce Type: replace-cross Abstract: Code generation is crucial in software engineering for automating the coding process efficiently. While test-time computation methods show promise, they suffer from high latency due to multiple computation rounds. To overcome this, we introduce \textbf{ThinkCoder}, a framework that combines thorough exploration with optimal refinement. The exploration phase diversifies the solution space by searching for potential solutions, followed by a refinement phase that enhances precision. This approach allows us to select the best solution through careful consideration before taking action, avoiding excessive trial and error. To further minimize test-time computation overhead, we introduce preference-driven optimization with Reinforced Self-Training (ReST), which uses exploration trajectories from ThinkCoder to guide LLM's evolution. This approach enhances LLM's exploration efficiency via preference learning, cutting costs while maintaining accuracy. ThinkCoder boosts the performance with a single LLM, excelling on benchmarks like HumanEval and MBPP. Compared to SOTA models, it improves Pass@1 by 3.0% over MapCoder with just 6.4% of the computation cost. Against AgentCoder, ThinkCoder achieves a 0.5% higher Pass@1 after 2 rounds, outperforming AgentCoder's 5 rounds. Additionally, ReST with success trajectories enhances efficiency, allowing models like LLaMA2-7B to achieve competitive results using only 20% of the computational resources. These results highlight the framework's effectiveness and scalability.

摘要

代码生成在软件工程中对实现高效自动化编码至关重要。尽管测试时计算方法展现出潜力,但由于需进行多轮计算,其存在高延迟问题。为此,我们提出ThinkCoder框架,该框架将全面探索与优化精炼相结合:探索阶段通过搜索潜在解决方案扩展解空间,精炼阶段则提升解决方案的精确度。这种方法使我们在采取行动前能通过审慎考量选择最优解,避免过度试错。为进一步降低测试时计算开销,我们引入基于强化自训练(ReST)的偏好驱动优化,利用ThinkCoder的探索轨迹指导大语言模型(LLM)的进化。该方法通过偏好学习提升LLM的探索效率,在保持准确性的同时降低成本。ThinkCoder仅用单一LLM即可提升性能,在HumanEval和MBPP等基准测试中表现卓越。相比最先进模型,其Pass@1指标较MapCoder提升3.0%,而计算成本仅需6.4%。与AgentCoder相比,ThinkCoder在2轮迭代后Pass@1高出0.5%,优于AgentCoder的5轮表现。此外,结合成功轨迹的ReST能提升效率,使LLaMA2-7B等模型仅用20%计算资源即可获得具有竞争力的结果。这些成果凸显了该框架的有效性与可扩展性。


No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language Models

Abstract

arXiv:2503.11985v2 Announce Type: replace-cross Abstract: Advancements in Large Language Models (LLMs) have increased the performance of different natural language understanding as well as generation tasks. Although LLMs have breached the state-of-the-art performance in various tasks, they often reflect different forms of bias present in the training data. In the light of this perceived limitation, we provide a unified evaluation of benchmarks using a set of representative small and medium-sized LLMs that cover different forms of biases starting from physical characteristics to socio-economic categories. Moreover, we propose five prompting approaches to carry out the bias detection task across different aspects of bias. Further, we formulate three research questions to gain valuable insight in detecting biases in LLMs using different approaches and evaluation metrics across benchmarks. The results indicate that each of the selected LLMs suffer from one or the other form of bias with the Phi-3.5B model being the least biased. Finally, we conclude the paper with the identification of key challenges and possible future directions.

摘要

大语言模型(LLMs)的进展提升了各类自然语言理解与生成任务的性能。尽管LLMs在多项任务中实现了最先进的性能表现,但它们往往反映出训练数据中存在的不同形式偏见。针对这一局限性,我们采用一组覆盖从物理特征到社会经济类别的代表性中小型LLMs,对基准测试进行了统一评估。此外,我们提出五种提示方法以执行跨不同偏见维度的检测任务,并构建三个研究问题,通过不同方法和评估指标在基准测试中获取关于LLMs偏见检测的重要洞见。结果表明,所选LLMs均存在某种形式的偏见,其中Phi-3.5B模型的偏见程度最低。最后,我们通过识别关键挑战和潜在未来研究方向对本文进行了总结。


Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases

Abstract

arXiv:2502.19249v2 Announce Type: replace-cross Abstract: Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language should fall within the computational limitations of the architecture. Strikingly, pre-pretraining reduces loss more efficiently than training on a matched amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. Finally, we also give mechanistic evidence of transfer from formal to natural language: attention heads acquired during pre-pretraining remain crucial for the model's performance on syntactic evaluations.

摘要

在形式语言上预训练语言模型可以提升其对自然语言的习得能力。究竟是形式语言的哪些特征能够产生有效的迁移归纳偏置?基于语言学和复杂性理论的洞见,我们提出假设:当满足两个条件时,迁移效果最佳——形式语言需捕捉自然语言中的依存结构,同时保持在模型架构的计算限度内。通过对变压器模型进行预预训练(在自然语言之前训练形式语言)的实验,我们发现具有层级依存特征的形式语言确实能使语言模型在自然语言上获得更低的损失值,并展现出比其他形式语言更好的语言泛化能力。对于形式语言应处于架构计算限度内的假设,我们也获得了适度支持。值得注意的是,预预训练比同等规模的自然语言训练能更高效地降低损失值。对于一个在约16亿自然语言标记上训练的10亿参数语言模型,预预训练仅需减少33%的标记预算即可达到相同的损失值和更优的语言泛化表现。最后,我们还提供了形式语言向自然语言迁移的机制性证据:预预训练期间获得的注意力头对模型句法评估性能保持关键作用。


How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation

Abstract

arXiv:2503.09598v2 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) are widely deployed in diverse scenarios, the extent to which they could tacitly spread misinformation emerges as a critical safety concern. Current research primarily evaluates LLMs on explicit false statements, overlooking how misinformation often manifests subtly as unchallenged premises in real-world interactions. We curated EchoMist, the first comprehensive benchmark for implicit misinformation, where false assumptions are embedded in the query to LLMs. EchoMist targets circulated, harmful, and ever-evolving implicit misinformation from diverse sources, including realistic human-AI conversations and social media interactions. Through extensive empirical studies on 15 state-of-the-art LLMs, we find that current models perform alarmingly poorly on this task, often failing to detect false premises and generating counterfactual explanations. We also investigate two mitigation methods, i.e., Self-Alert and RAG, to enhance LLMs' capability to counter implicit misinformation. Our findings indicate that EchoMist remains a persistent challenge and underscore the critical need to safeguard against the risk of implicit misinformation.

摘要

随着大型语言模型(LLMs)在多样化场景中的广泛应用,其潜在传播错误信息的隐性风险已成为关键的安全问题。现有研究主要评估模型对显式错误陈述的识别能力,却忽视了现实交互中错误信息常以未被质疑的前提假设这一微妙形式存在。我们构建了首个针对隐性错误信息的综合基准EchoMist,通过将错误假设嵌入查询语句来测试LLMs。该基准涵盖来自人机对话和社交媒体互动等多源场景中传播性强、危害性高且持续演变的隐性错误信息。通过对15个前沿LLMs的实证研究,发现当前模型在此任务上表现堪忧:多数无法识别错误前提并生成违背事实的解释。我们进一步探究了两种缓解方法(自我警示机制和检索增强生成技术)以提升模型应对隐性错误信息的能力。研究表明EchoMist仍构成持续性挑战,凸显了防范隐性错误信息风险的紧迫性。


MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving

Abstract

arXiv:2503.03205v3 Announce Type: replace-cross Abstract: Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted the mathematical and computer science communities. State-of-the-art methods utilize a single Large Language Model (LLM) to generate complete proof or perform tree search, but they fail to balance these tasks. We propose MA-LoT: Model-CollAboration Lean-based Long Chain-of-Thought, a comprehensive framework for Lean4 theorem proving to solve this issue. It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction using the model-collaboration method. We achieve this by structured interaction of the LLM and Lean4 verifier in Long CoT. To implement the framework, we propose the novel LoT-Transfer Learning training-inference pipeline, which enables the Long CoT thinking capability to LLMs without special data annotation. Extensive experiment shows that our framework achieves a 61.07% accuracy rate on the Lean4 version of the MiniF2F-Test dataset, largely outperforming DeepSeek-V3 (33.61%), single-model tree search (InternLM-Step-Prover, 50.70%), and whole-proof generation (Godel-Prover, 55.33%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.

摘要

使用Lean等计算机可验证语言解决数学问题已对数学和计算机科学界产生重大影响。现有先进方法采用单一大型语言模型(LLM)生成完整证明或执行树搜索,但未能平衡这些任务。我们提出MA-LoT框架:基于模型协作与Lean的长链思维,这是一个用于Lean4定理证明的综合框架,通过模型协作方法将整体证明生成的通用自然语言认知任务与纠错分析任务解耦。该框架通过LLM与Lean4验证器在长链思维中的结构化交互实现目标。为实现此框架,我们提出创新的LoT迁移学习训练-推理流程,无需特殊数据标注即可使LLM具备长链思维能力。大量实验表明,本框架在MiniF2F-Test数据集的Lean4版本上达到**61.07%**准确率,显著优于DeepSeek-V3(33.61%)、单模型树搜索(InternLM-Step-Prover,50.70%)和整体证明生成(Godel-Prover,55.33%)基线。此外,我们的研究揭示了长链思维与形式化验证相结合在更广阔领域实现更具洞察力生成的潜力。


ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models

Abstract

arXiv:2503.13107v2 Announce Type: replace-cross Abstract: Contrastive decoding strategies are widely used to mitigate object hallucinations in multimodal large language models (MLLMs). By reducing over-reliance on language priors, these strategies ensure that generated content remains closely grounded in visual inputs, producing contextually accurate outputs. Since contrastive decoding requires no additional training or external tools, it offers both computational efficiency and versatility, making it highly attractive. However, these methods present two main limitations: (1) bluntly suppressing language priors can compromise coherence and accuracy of generated content, and (2) processing contrastive inputs adds computational load, significantly slowing inference speed. To address these challenges, we propose Visual Amplification Fusion (VAF), a plug-and-play technique that enhances attention to visual signals within the model's middle layers, where modality fusion predominantly occurs. This approach enables more effective capture of visual features, reducing the model's bias toward language modality. Experimental results demonstrate that VAF significantly reduces hallucinations across various MLLMs without affecting inference speed, while maintaining coherence and accuracy in generated outputs.

摘要

对比解码策略被广泛应用于缓解多模态大语言模型(MLLMs)中的物体幻觉问题。通过降低对语言先验的过度依赖,这些策略确保生成内容紧密基于视觉输入,从而产生符合上下文的准确输出。由于对比解码无需额外训练或外部工具,其兼具计算高效性和多功能性,因而极具吸引力。然而,此类方法存在两大局限:(1)粗暴抑制语言先验会损害生成内容的连贯性与准确性;(2)处理对比输入会增加计算负荷,显著降低推理速度。为解决这些问题,我们提出视觉放大融合(VAF)技术——一种即插即用方案,通过在模型中间层(模态融合主要发生层)增强对视觉信号的关注,从而更有效地捕捉视觉特征,减少模型对语言模态的偏向。实验结果表明,VAF能在不影响推理速度的前提下,显著降低各类MLLMs的幻觉现象,同时保持生成内容的连贯性与准确性。


LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers

Abstract

arXiv:2503.14434v2 Announce Type: replace-cross Abstract: Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks.

摘要

自动化特征工程在提升表格学习任务预测模型性能方面具有关键作用。传统自动化特征工程方法受限于其依赖固定人工设计搜索空间中的预定义变换,往往忽视领域知识。近期利用大语言模型(LLMs)的进展使得领域知识能够融入特征工程过程。然而,现有基于LLM的方法要么使用直接提示,要么仅依赖验证分数进行特征选择,未能利用先前特征发现实验的洞见,或在特征生成与数据驱动性能间建立有效推理。为解决这些问题,我们提出LLM-FE框架,该框架将进化搜索与LLMs的领域知识和推理能力相结合,自动发现适用于表格学习任务的有效特征。LLM-FE将特征工程建模为程序搜索问题,通过LLMs迭代提出新特征变换程序,并由数据驱动的反馈引导搜索过程。实验结果表明,LLM-FE在多种分类和回归基准测试中持续优于现有最先进基线方法,显著提升了表格预测模型的性能。


Achieving binary weight and activation for LLMs using Post-Training Quantization

Abstract

arXiv:2504.05352v2 Announce Type: replace-cross Abstract: Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models. Code is available at https://github.com/JimmyCrave/LLM-PTQ-binarization.

摘要

将大型语言模型(LLMs)量化至1比特精度可显著降低计算成本,但现有量化技术在权重和激活精度低于4比特(W4A4)时会出现明显性能下降。本文提出一种采用W(1+1)A(14)配置的训练后量化框架:权重被量化为1比特并额外增加1比特用于细粒度分组,激活被量化为1比特同时通道数扩展4倍。针对权重量化,我们提出基于Hessian感知的细粒度分组方法及EM驱动的量化方案。对于激活量化,我们将INT4量化的激活等效分解为4INT1格式,并基于量化误差同步平滑缩放因子,从而进一步降低激活量化误差。在W2A4配置下,本方法在多项任务中超越了最先进的LLM量化基线,将现有LLM量化方法的边界推向完全二值化模型。代码发布于https://github.com/JimmyCrave/LLM-PTQ-binarization。


H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding

Abstract

arXiv:2503.24008v2 Announce Type: replace-cross Abstract: With the rapid development of multimodal models, the demand for assessing video understanding capabilities has been steadily increasing. However, existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability. These shortcomings hinder the accurate assessment of models' comprehensive video understanding capabilities. To tackle this challenge, we propose a hierarchical and holistic video understanding (H2VU) benchmark designed to evaluate both general video and online streaming video comprehension. This benchmark contributes three key features: Extended video duration: Spanning videos from brief 3-second clips to comprehensive 1.5-hour recordings, thereby bridging the temporal gaps found in current benchmarks. Comprehensive assessment tasks: Beyond traditional perceptual and reasoning tasks, we have introduced modules for countercommonsense comprehension and trajectory state tracking. These additions test the models' deep understanding capabilities beyond mere prior knowledge. Enriched video data: To keep pace with the rapid evolution of current AI agents, we have expanded first-person streaming video datasets. This expansion allows for the exploration of multimodal models' performance in understanding streaming videos from a first-person perspective. Extensive results from H2VU reveal that existing multimodal large language models (MLLMs) possess substantial potential for improvement in our newly proposed evaluation tasks. We expect that H2VU will facilitate advancements in video understanding research by offering a comprehensive and in-depth analysis of MLLMs.

摘要

随着多模态模型的快速发展,对视频理解能力评估的需求持续增长。然而现有视频理解评估基准在覆盖范围、任务多样性和场景适应性方面存在显著局限,这些缺陷阻碍了对模型综合视频理解能力的准确评估。为解决这一挑战,我们提出分层式整体视频理解(H2VU)基准,旨在评估通用视频和在线流媒体视频理解能力。该基准具备三项关键特征:扩展的视频时长——涵盖从3秒短视频到1.5小时长视频的跨度,弥补现有基准的时间维度空白;全面的评估任务——除传统感知与推理任务外,新增反常识理解与轨迹状态追踪模块,测试模型超越先验知识的深层理解能力;丰富的视频数据——为匹配当前AI智能体的快速发展,我们扩充了第一人称流媒体视频数据集,以探索多模态模型在第一视角流媒体视频理解中的表现。H2VU的大规模实验结果表明,现有多模态大语言模型(MLLMs)在我们提出的新评估任务中具有显著改进潜力。我们期待H2VU基准能通过对MLLMs进行全面深入的分析,推动视频理解研究的发展。


Agentic Medical Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge

Abstract

arXiv:2502.13010v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have significantly advanced medical question-answering by leveraging extensive clinical data and medical literature. However, the rapid evolution of medical knowledge and the labor-intensive process of manually updating domain-specific resources pose challenges to the reliability of these systems. To address this, we introduce Agentic Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates the construction and continuous updating of medical knowledge graphs, integrates reasoning, and retrieves current external evidence, such as PubMed and WikiSearch. By dynamically linking new findings and complex medical concepts, AMG-RAG not only improves accuracy but also enhances interpretability in medical queries. Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of 66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to 100 times larger. Notably, these improvements are achieved without increasing computational overhead, highlighting the critical role of automated knowledge graph generation and external evidence retrieval in delivering up-to-date, trustworthy medical insights.

摘要

大型语言模型(LLMs)通过利用大量临床数据和医学文献,显著推进了医疗问答系统的发展。然而,医学知识的快速演进与人工更新领域专用资源的高成本流程,对这些系统的可靠性提出了挑战。为此,我们提出Agentic医学图谱检索增强生成框架(AMG-RAG),该框架能自动化构建并持续更新医学知识图谱,整合推理过程,同时检索最新外部证据(如PubMed和WikiSearch)。通过动态关联新发现与复杂医学概念,AMG-RAG不仅提升了医疗查询的准确性,还增强了结果的可解释性。在MEDQA和MEDMCQA基准测试中,AMG-RAG分别取得74.1%的F1分数和66.34%的准确率,优于同类模型及参数量10至100倍的模型。值得注意的是,这些改进并未增加计算开销,凸显了自动化知识图谱构建与外部证据检索在提供最新、可信医疗见解中的关键作用。